TrEMBL — LLMpedia

TrEMBL
Name	TrEMBL
Released	1996
Producer	UniProt Consortium
Country	International
Discipline	Bioinformatics
Formats	FASTA, XML, RDF
Access	Public

Contents

History
Content and Scope
Data Annotation and Quality Control
Submission and Update Process
Access and Usage
Relationship to UniProtKB and Swiss-Prot

TrEMBL is a computationally translated protein sequence resource originating from large-scale sequencing projects and maintained for integration with curated protein knowledgebases. It was created to bridge high-throughput sequencing outputs from initiatives and consortia into a repository suitable for downstream annotation, comparative analysis, and integration with curated resources.

History

TrEMBL was initiated in the mid-1990s amid growth in sequencing centered on projects such as the Human Genome Project, Drosophila Genome Project, Saccharomyces Genome Project, International HapMap Project, and efforts by the Sanger Centre. Development occurred in parallel with institutions including the European Bioinformatics Institute, the Swiss Institute of Bioinformatics, and the European Molecular Biology Laboratory. Funding and coordination involved agencies and programs like the European Commission, Wellcome Trust, and national research organizations such as the National Institutes of Health. Key milestones track integration with databases managed by the Protein Information Resource, the National Center for Biotechnology Information, and collaborations with the Joint Genome Institute and sequencing centers at Washington University in St. Louis and the Broad Institute. Community standards from groups such as the Gene Ontology Consortium, the GenBank submission protocols, and policy discussions at conferences including the Intelligent Systems for Molecular Biology meetings influenced TrEMBL’s scope and format.

Content and Scope

The resource aggregates translated coding sequences derived from repositories including GenBank, EMBL Nucleotide Sequence Database, and DDBJ, as well as proteomes assembled by projects led by the J. Craig Venter Institute and laboratories at the Wellcome Sanger Institute. Entries encompass predicted open reading frames from model-organism efforts such as Caenorhabditis elegans and Mus musculus, environmental sequencing from initiatives like the Tara Oceans Expedition, and pathogen-focused projects exemplified by work on Mycobacterium tuberculosis and Influenza A virus. Sequence formats and cross-references align with standards promoted by the International Nucleotide Sequence Database Collaboration. The scope covers bacteria catalogued by the Human Microbiome Project, archaeal genomes sequenced by groups at the Max Planck Institute for Marine Microbiology, and metagenomes deposited by the Joint Genome Institute.

Data Annotation and Quality Control

Automated annotation pipelines for the database employed algorithms and resources such as BLAST, HMMER, and profile libraries like Pfam, InterPro, and PROSITE for functional inference. Cross-references draw from curated resources maintained by the Protein Data Bank, the Enzyme Commission, and nomenclature managed by committees such as the HUGO Gene Nomenclature Committee and organism-specific databases including FlyBase and WormBase. Quality control workflows adopted ontologies and controlled vocabularies from the Gene Ontology Consortium and utilized taxonomic standards from the NCBI Taxonomy database and the Catalogue of Life. Error detection, redundancy checks, and annotation flags paralleled practices at the UniProt Consortium and compliance discussions at meetings organized by the International Society for Biocuration.

Submission and Update Process

Sequence submission sources included primary data providers like GenBank, direct throughput from sequencing centers at the Broad Institute and the Wellcome Trust Sanger Institute, and assemblies from translational research groups at institutions such as MIT and Harvard Medical School. Update cycles synchronized with releases from the European Nucleotide Archive and harmonized identifiers used by databases including the Protein Data Bank and RefSeq. Versioning and accessioning adhered to conventions compatible with registries operated by the International Nucleotide Sequence Database Collaboration, with provenance metadata reflecting submitters such as the Max Planck Society and the Chinese Academy of Sciences.

Access and Usage

Public access was provided via portals maintained by the European Bioinformatics Institute and mirrored by services at the Swiss Institute of Bioinformatics and the National Center for Biotechnology Information. Formats for programmatic access included download mirrors, FTP services used by the European Molecular Biology Laboratory, and APIs employed by tools developed at institutions like EMBL-EBI and commercial platforms at companies such as Illumina and Thermo Fisher Scientific. Usage spans comparative genomics in labs at Stanford University and University of California, San Diego, proteomics pipelines at Max Planck Institute for Biochemistry, and drug-target discovery efforts in collaborations with pharmaceutical groups at GlaxoSmithKline and Roche.

Relationship to UniProtKB and Swiss-Prot

The dataset functioned as an intermediary to feed sequences into curated repositories such as UniProtKB and the manually reviewed Swiss-Prot entries maintained by the UniProt Consortium and edited at centers including the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. Integration workflows relied on curation teams similar to those at UniProt and cross-referenced authoritative resources like PDB, Gene Ontology Consortium, and organism databases such as FlyBase and WormBase. Transition of entries from the automated repository into curated records followed expert review practices established by curators trained at institutions like the European Bioinformatics Institute and coordinated with policy frameworks discussed at meetings of the International Society for Biocuration.

Category:Biological databases