UniProtKB — LLMpedia

UniProtKB
Name	UniProtKB
Author	European Bioinformatics Institute; Swiss Institute of Bioinformatics; Protein Information Resource
Released	2002
Access	free

Contents

Overview
History and Development
Database Structure and Content
Annotation and Curation Processes
Access, Tools, and Services
Usage and Applications
Governance and Funding Sources

UniProtKB is a comprehensive, curated protein sequence and functional information resource maintained by an international consortium. It integrates protein sequence data, functional annotations, cross-references and nomenclature to support research in molecular biology, genomics, proteomics and biomedical sciences. The resource is used worldwide by researchers at institutions such as European Molecular Biology Laboratory, National Institutes of Health, Max Planck Society, University of Cambridge, and Harvard University.

Overview

UniProtKB provides a central repository combining manually reviewed entries and computationally annotated records to serve diverse communities including those at European Bioinformatics Institute, Swiss Institute of Bioinformatics, Protein Information Resource, Wellcome Trust Sanger Institute, and Broad Institute. The resource interlinks with major resources such as Protein Data Bank, Gene Ontology Consortium, Ensembl, RefSeq, and Pfam while supporting standards developed by organizations like International Nucleotide Sequence Database Collaboration and World Wide Web Consortium. It supplies accession numbers, curated protein names, organism information linking to sources like NCBI, and cross-references to pathway databases including KEGG and Reactome.

History and Development

Origins trace to collaborative efforts between groups at European Molecular Biology Laboratory, Swiss Institute of Bioinformatics, and Protein Information Resource in response to the growing flood of sequence data following milestones such as the Human Genome Project and sequencing initiatives led by J. Craig Venter and Francis Collins. Key development phases were shaped by advances at institutions like EMBL-EBI and funding from agencies including European Commission and National Human Genome Research Institute. Integration with resources such as Swiss-Prot and TrEMBL reflected community-driven consolidation akin to efforts that produced repositories like GenBank and EMBL Nucleotide Sequence Database.

Database Structure and Content

The knowledgebase is partitioned into reviewed and unreviewed sets, with identifiers and versioning schemes compatible with International Committee on Taxonomy of Viruses nomenclature and organism references such as Taxonomy of Organisms. Entries include sequence features, domain annotations linked to InterPro, post-translational modifications cross-referenced to databases like PhosphoSitePlus and experimental evidence tied to publications in journals such as Nature, Science, Cell, Journal of Biological Chemistry, and Proceedings of the National Academy of Sciences. Relationships to model organism resources like Saccharomyces Genome Database, Mouse Genome Informatics, WormBase, and FlyBase enable mapping between protein entries and genetic loci.

Annotation and Curation Processes

Manual curation is performed by expert curators trained in standards promoted by bodies such as Gene Ontology Consortium and utilizes literature from publishers including Elsevier, Springer Nature, Wiley, and American Society for Microbiology. Computational annotation pipelines leverage algorithms and resources from groups like Hidden Markov Model developers at EMBL-EBI, pattern repositories such as PROSITE, and clustering methods exemplified by projects at UniRef and OrthoDB. Quality control integrates peer-reviewed evidence from databases like PubMed and identifiers from registries such as Digital Object Identifier to ensure traceability and provenance.

Access, Tools, and Services

The resource is distributed through web interfaces and programmatic endpoints used by platforms such as Galaxy (computational biology), Cytoscape, BLAST, and UniProt Consortium tools. Bulk downloads support pipelines used at centers like Genomics England, European Genome-phenome Archive, and National Center for Biotechnology Information. Visualization components connect to viewers developed in collaboration with projects at European Research Council grantees and incorporate ontologies from OBO Foundry members. Educational and training activities are coordinated with partners like FAIRsharing and institutions such as Cold Spring Harbor Laboratory.

Usage and Applications

Researchers in structural biology reference cross-links to Protein Data Bank entries and integrate UniProtKB data into studies led by groups at Max Planck Institute for Biophysical Chemistry, Riken, and Scripps Research. Clinical variant interpretation pipelines at organizations like ClinVar and European Medicines Agency utilize curated annotations for drug target validation in programs supported by Bill & Melinda Gates Foundation or collaborations with companies such as Genentech, Pfizer, and Roche. Comparative genomics projects at Broad Institute and ecological metaproteomics studies linked to Joint Genome Institute employ UniProtKB identifiers to aggregate functional profiles across datasets.

Governance and Funding Sources

Governance is provided by an international consortium involving institutions including European Bioinformatics Institute, Swiss Institute of Bioinformatics, and Protein Information Resource, with oversight mechanisms aligned with policies from funders such as European Commission, Wellcome Trust, National Institutes of Health, and philanthropic organizations like Gordon and Betty Moore Foundation. Core infrastructure funding and project grants have been awarded through competitive programs administered by entities such as Horizon 2020, UK Research and Innovation, and national research councils like Deutsche Forschungsgemeinschaft and Agence Nationale de la Recherche.

Category:Biological databases