InterPro — LLMpedia

InterPro
Name	InterPro
Established	1999
Type	Biological database
Focus	Protein families, domains, functional sites
Country	International
Access	Public
License	Various (member databases)

Contents

Overview
Data and Classification Methods
Database Content and InterProSignatures
Tools and Access (Web, APIs, and Software)
Applications and Use Cases
History and Development

InterPro InterPro is a bioinformatics resource that integrates predictive models, signatures, and annotations to classify protein sequences into families, domains, and functional sites. It aggregates data from multiple member databases and provides unified protein annotations used by researchers, curators, and large-scale projects in genomics and proteomics. InterPro supports sequence analysis for organisms studied in projects associated with institutions such as European Bioinformatics Institute, National Center for Biotechnology Information, Wellcome Trust Sanger Institute, and consortia like UniProt, Gene Ontology Consortium, and Ensembl.

Overview

InterPro functions as a composite resource combining models from specialized resources including Pfam, PROSITE, PRINTS, SUPERFAMILY, SMART (protein domain), PANTHER (protein) and TIGRFAMs. By unifying signature descriptions from these member databases, InterPro produces consolidated entries that describe protein families, domains, repeats, and conserved sites. Major users and contributors include projects at European Molecular Biology Laboratory, Genome Analysis Centre, and national databases such as GenBank and RefSeq. InterPro annotations are frequently propagated to resources like UniProtKB, Ensembl Genomes, and comparative platforms used in studies by groups associated with Wellcome Trust funding and institutes such as Max Planck Society.

Data and Classification Methods

InterPro assembles predictive models—hidden Markov models, regular expressions, and pattern matrices—compiled by member resources including HMMER-using groups and motif databases such as PROSITE. Classification relies on hierarchical curation to map signatures to InterPro entries, creating family trees and domain architectures linked to ontology terms from the Gene Ontology Consortium and cross-references to external resources like NCBI Conserved Domain Database and Protein Data Bank. Automated pipelines integrate algorithms developed at organizations such as EMBL-EBI and research groups contributing tools like BLAST and HMMER3. Curatorial review by experts associated with institutes like European Bioinformatics Institute and universities ensures consistency with standards advocated by bodies such as the Open Bioinformatics Foundation.

Database Content and InterProSignatures

Content includes curated entries for protein families, domains, repeats, and sites, each associated with member-database signatures (InterProSignatures). Signatures originate from contributors including Pfam Consortium, PANTHER Consortium, TIGRFAMs Consortium, and independent laboratories linked to institutes such as Sanger Institute and University of Cambridge. Entries are annotated with descriptive texts, Gene Ontology cross-links from the Gene Ontology Consortium, and literature pointers to journals such as Nature, Science, and Nucleic Acids Research. Cross-references span databases and resources including UniProtKB, Ensembl, RefSeq, KEGG, and structural links to Protein Data Bank entries, enabling integration with pathway resources curated by teams at European Molecular Biology Laboratory and international collaborators.

Tools and Access (Web, APIs, and Software)

InterPro data are accessible via a web portal maintained at institutions like European Bioinformatics Institute and through programmatic interfaces used by platforms such as UniProt, Ensembl, and genome browsers developed at Broad Institute. Programmatic access includes RESTful APIs and data dumps consumed by pipeline tools including HMMER, EMBOSS, and workflow managers used in projects at Wellcome Sanger Institute and universities like University of Oxford. Visualization and analysis are supported by third-party software created by groups at Galaxy Project, Bioconductor, and developers associated with Open Bioinformatics Foundation. Commercial and academic users integrate InterPro annotations into annotation pipelines employed in initiatives funded by bodies like the European Research Council.

Applications and Use Cases

Researchers apply InterPro annotations in functional annotation of genomes sequenced by consortia such as 1000 Genomes Project and Human Genome Project, comparative genomics studies from groups at Max Planck Institute and Stanford University, and metagenomics analyses in projects linked to European Molecular Biology Laboratory and environmental initiatives. Clinical and translational research teams at institutions like Broad Institute and Massachusetts General Hospital use InterPro-informed predictions to interpret variants in proteins implicated in diseases reported in journals like The Lancet and New England Journal of Medicine. Structural biologists cross-reference InterPro entries with Protein Data Bank structures to map domains for hypothesis-driven mutagenesis in labs at Cold Spring Harbor Laboratory and Salk Institute. Agricultural genomics projects at institutes such as John Innes Centre and INRAE use InterPro for crop gene family annotation.

History and Development

InterPro originated in the late 1990s from collaborative efforts among member databases seeking to unify protein signature resources, with early coordination involving institutions like European Bioinformatics Institute and Wellcome Trust Sanger Institute. Over time, additional member databases and consortia joined, including Pfam, PROSITE, PRINTS, SMART (protein), and PANTHER (protein), expanding coverage and establishing links to ontologies maintained by the Gene Ontology Consortium. Development milestones include integration of automated pipelines leveraging tools such as HMMER3 and adoption of web services used by projects at UniProt and Ensembl. Ongoing curation and collaboration continue through networks of contributors across universities, research institutes, and funding agencies such as the Wellcome Trust and European Research Council.

Category:Biological databases