LLMpediaThe first transparent, open encyclopedia generated by LLMs

UniProt

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Expansion Funnel Raw 75 → Dedup 12 → NER 9 → Enqueued 8
1. Extracted75
2. After dedup12 (None)
3. After NER9 (None)
Rejected: 3 (not NE: 3)
4. Enqueued8 (None)
UniProt
NameUniProt
TypeBiological database
ScopeProtein sequence and functional information
CountryInternational
Established2002
Maintained byUniProt Consortium

UniProt is a comprehensive protein sequence and functional information resource produced by an international consortium. It integrates curated knowledge about proteins from multiple model organisms and pathogens, supporting research across biomedical, agricultural, and industrial domains. The resource aggregates sequence data, functional annotation, cross-references, and evidence attribution to facilitate experimental design, comparative genomics, and computational analyses.

Overview

UniProt provides a centralized repository combining entries from major contributors such as the European Molecular Biology Laboratory European Bioinformatics Institute, the Swiss Institute of Bioinformatics Swiss Institute of Bioinformatics, and the Protein Information Resource Protein Information Resource. It links protein entries to genetic resources like Ensembl, RefSeq, and GenBank, and to functional resources including Gene Ontology, Pfam, InterPro, and Reactome. Users access cross-references to organismal resources such as Saccharomyces Genome Database, WormBase, FlyBase, and Mouse Genome Informatics, and to translational resources like ClinicalTrials.gov, OMIM, and DrugBank. The database supports bioinformatics workflows with identifiers interoperable with UniProt Consortium member institutions and large-scale repositories like European Nucleotide Archive, Protein Data Bank, and KEGG.

History and Development

The project emerged from collaborations among institutions in the early 2000s with roots in initiatives such as Swiss-Prot, TrEMBL, and PIR-PSD, and formalized through consortium agreements reminiscent of collaborations like Human Genome Project and International Nucleotide Sequence Database Collaboration. Key stakeholders have included teams from European Bioinformatics Institute, Swiss Institute of Bioinformatics, and the Protein Information Resource who coordinated annotation standards influenced by resources like Gene Ontology Consortium and policy frameworks from funders such as the European Commission and the National Institutes of Health. Major milestones parallel efforts in structural biology exemplified by the Protein Data Bank and in genomics exemplified by Ensembl releases. Community initiatives including conferences such as the Intelligent Systems for Molecular Biology meeting and standards bodies like the International Nucleotide Sequence Database Collaboration informed controlled vocabularies, evidence codes, and data exchange formats.

Data Content and Structure

Entries encompass reviewed records originating from curated collections and unreviewed records from automated pipelines. Curated data integrate experimentally supported annotations from literature associated with journals such as Nature, Science, Cell, PNAS, and The EMBO Journal. Structural cross-links connect to entries in Protein Data Bank, EMBL-EBI, and domain databases like Pfam and SMART. Functional annotations use controlled vocabularies from Gene Ontology and pathway mappings to Reactome and KEGG. Sequence provenance ties to genome assemblies from Ensembl, transcript annotation projects like GENCODE, and submission archives such as GenBank. Taxonomic metadata references classification schemes from NCBI Taxonomy and organismal resources including UniProt Taxonomy. Evidence tags reflect standards from institutions like International Society for Computational Biology and community efforts documented at meetings such as Genome Informatics.

Access and Tools

The resource is accessible via web portals operated by European Bioinformatics Institute and programmatically through REST and FTP services patterned on infrastructure used by EMBL-EBI and other bioinformatics providers. Tools for searching, sequence similarity, and batch retrieval integrate algorithms and services related to BLAST, HMMER, and CD-HIT, and visualization components link to viewers like Jalview and UCSC Genome Browser tracks. Annotation pipelines interoperate with platforms such as Ensembl Variant Effect Predictor, InterProScan, and workflow systems exemplified by Galaxy and Nextflow. Training and outreach align with programs from institutions such as Cold Spring Harbor Laboratory and Wellcome Trust-funded courses, and programmatic use is supported by standards from FAIR principles advocates and data-sharing policies of agencies like the National Science Foundation.

Data Curation and Quality Control

Curation combines manual expert review and automated annotation leveraging machine learning approaches used in projects like AlphaFold for structure prediction and pattern recognition methods from Pfam and InterPro. Manual curators extract evidence from primary literature published in outlets including Journal of Biological Chemistry and Nucleic Acids Research, applying evidence codes influenced by the Gene Ontology Consortium and annotation guidelines similar to those used by GenBank submitters. Quality control employs redundancy checks, taxonomy validation against NCBI Taxonomy, and cross-reference reconciliation with resources like RefSeq and PDB. Release cycles and versioning practices follow provenance norms exemplified by Ensembl and the Protein Data Bank to ensure reproducibility.

Usage and Impact in Research

Researchers across domains cite the resource in studies of model organisms such as Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Mus musculus, and Homo sapiens, and in pathogen research on organisms including Mycobacterium tuberculosis, Plasmodium falciparum, and SARS-CoV-2. It underpins large-scale projects in comparative genomics like ENCODE and 1000 Genomes Project, proteomics initiatives employing techniques from laboratories such as those at Max Planck Institute and European Molecular Biology Laboratory, and systems biology modeling in consortia including Human Cell Atlas. Translational applications link to pharmacology resources like DrugBank and clinical genetics resources such as ClinVar and OMIM. The dataset fuels machine learning efforts in protein function prediction influenced by work at institutions like DeepMind and informs standards-setting bodies including Gene Ontology Consortium and community-driven annotation jamborees.

Category:Biological databases