Pfam — LLMpedia

Pfam
Name	Pfam
Type	Biological database
Scope	Protein families and domains
Country	United Kingdom
Institution	Wellcome Sanger Institute; European Bioinformatics Institute
Established	1994
Format	Multiple sequence alignments; profile hidden Markov models

Contents

Overview
Database Structure and Content
Construction and Curation Methods
Applications and Uses
Access and Tools
History and Development

Pfam

Pfam is a widely used protein family resource that organizes protein sequences into families and domains for comparative analysis and annotation. It integrates curated alignments, profile hidden Markov models, and classification metadata to support research in molecular biology, genomics, and structural biology. Major users include researchers at institutions such as the Wellcome Sanger Institute, the European Bioinformatics Institute, and universities engaged in projects like the Human Genome Project and the 1000 Genomes Project.

Overview

Pfam catalogs protein families by grouping homologous protein regions into entries linked to curated alignments and statistical models. It serves communities working on model organisms such as Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana, as well as clinical projects like the Cancer Genome Atlas and the Human Proteome Project. The resource interoperates with complementary databases and resources including UniProt, Protein Data Bank, InterPro, Ensembl, and NCBI RefSeq to provide cross-references and to anchor sequence annotations in experimental and computational evidence.

Database Structure and Content

Pfam entries consist of a seed alignment, a full alignment, and a profile hidden Markov model (HMM) that captures the conserved features of a protein family. Each entry links to structural data in repositories like the Protein Data Bank and to functional annotations from projects such as Gene Ontology and KEGG. The collection contains thousands of entries covering enzymatic families, signaling modules, transcription factors, and membrane proteins found across taxa including Bacteria, Archaea, Viruses, and eukaryotic clades like Chordata and Viridiplantae. Related resources and classifications include SCOP, CATH, SMART (database), TIGRFAMs, and CDD (Conserved Domain Database).

Construction and Curation Methods

Pfam builds models by curators and automated pipelines that start from a manually curated seed alignment, often assembled from experimentally characterized sequences and reference proteomes such as those of Homo sapiens, Mus musculus, Zea mays, and Saccharomyces cerevisiae. Seed alignments are extended using sequence databases like UniProtKB and GenBank with iterative searches performed by tools including HMMER and BLAST. Curation integrates evidence from structural determinations in the Protein Data Bank, literature tied to journals such as Nature, Science, and Proceedings of the National Academy of Sciences, and annotations from consortia like the European Molecular Biology Laboratory and the Wellcome Trust. Quality control assesses alignment coverage, model specificity, and overlap resolution with methods influenced by statistical frameworks from researchers affiliated with institutions including University of Cambridge and Stanford University.

Applications and Uses

Pfam models are used for domain annotation in large-scale genome projects like the Human Genome Project and pathogen surveillance initiatives exemplified by GISAID and Global Virome Project. Applications include inferring protein function for entries linked to pathways in KEGG, predicting domain architectures for proteins studied in labs at Massachusetts Institute of Technology and Max Planck Society, guiding mutagenesis experiments cited in publications from Cell and The EMBO Journal, and enabling comparative analyses in evolutionary studies referencing Darwin's theory of evolution and phylogenetic frameworks developed at institutions like Smithsonian Institution. Pfam-based annotations support biotechnology and pharmaceutical research in companies and centers including GlaxoSmithKline and Broad Institute.

Access and Tools

Pfam data are accessible through a web interface hosted by organizations such as the European Bioinformatics Institute and mirror servers at the Wellcome Sanger Institute. Programmatic access is provided via downloadable HMM libraries and command-line tools like HMMER for local searches, with integration into platforms including UniProt, Ensembl, and workflow systems used at European Nucleotide Archive and research infrastructures like ELIXIR. Visualization and analysis tools interoperate with structural viewers that consume PDB coordinates and with multiple sequence alignment tools such as MAFFT and Clustal Omega used by groups from University of Oxford and ETH Zurich.

History and Development

Pfam originated in the mid-1990s amid efforts to systematize protein domain annotation alongside initiatives such as the Human Genome Project and the establishment of repositories like UniProt. Early development involved collaborations between research groups at institutions including the Wellcome Trust Sanger Institute and the European Bioinformatics Institute, and drew on algorithmic advances from teams at European Molecular Biology Laboratory and academic departments such as University of Washington and University of California, San Diego. Over successive releases Pfam expanded its coverage through integration with projects like Pfam-A and community annotation efforts tied to conferences and consortia including ISMB and EMBO meetings. The resource continues to evolve with contributions from global communities at universities, national laboratories, and biomedical institutes including National Institutes of Health and Wellcome Trust funding programs.

Category:Biological databases