Generated by Llama 3.3-70B| GenBank | |
|---|---|
| Title | GenBank |
| Description | Nucleotide sequence database |
| Center | National Center for Biotechnology Information |
| Scope | Global |
| Released | 1982 |
| License | Public domain |
GenBank is a comprehensive public repository of nucleotide sequences and their protein translations. Maintained by the National Center for Biotechnology Information (NCBI), it is a core component of the International Nucleotide Sequence Database Collaboration (INSDC). This database serves as a foundational resource for research in genomics, molecular biology, and bioinformatics.
GenBank functions as the National Institutes of Health's primary genetic sequence database, archiving all publicly available DNA sequences. It is part of a tripartite international collaboration that also includes the DNA Data Bank of Japan (DDBJ) and the European Molecular Biology Laboratory's European Nucleotide Archive (ENA). Data exchanged daily among these partners ensures a synchronized, global repository. The resource is freely accessible and supports a vast array of scientific inquiries, from evolutionary biology to medical genetics.
The origins of this database trace back to the late 1970s, with early conceptual work by Walter Goad at the Los Alamos National Laboratory. It was formally established in 1982 under the leadership of the National Institute of General Medical Sciences (NIGMS). In 1992, oversight was transferred to the newly created National Center for Biotechnology Information. A landmark event was its role in the Human Genome Project, for which it became the primary data repository. The formation of the International Nucleotide Sequence Database Collaboration in the 1980s standardized global data exchange.
The database contains sequence data from over 500,000 formally described species, ranging from viruses and bacteria to eukaryotes like Homo sapiens. Each record includes the nucleotide sequence along with rich annotation such as coding sequences, mRNA boundaries, and features like single-nucleotide polymorphisms. Annotations are supplemented with bibliographic references linking to PubMed and taxonomic data from the NCBI Taxonomy Database. Protein translations are linked to corresponding entries in resources like RefSeq and UniProt.
Researchers worldwide submit data using tools like BankIt or the standalone Sequin software. Submissions are processed by staff at the National Center for Biotechnology Information who perform quality checks. Data can be accessed via the Entrez search and retrieval system, which integrates it with related databases such as PubMed Central and the Protein Data Bank. Programmatic access is enabled through the Entrez Programming Utilities (E-utilities) and bulk downloads via File Transfer Protocol (FTP).
This resource is indispensable for comparative genomics, enabling studies of gene function and phylogenetic relationships. It underpins diagnostic test development, vaccine design, and forensic science. During outbreaks like the COVID-19 pandemic, it was critical for tracking SARS-CoV-2 variants. The data also fuels machine learning applications in computational biology and supports major initiatives like the Earth BioGenome Project. Its open-access policy has democratized genomic research globally.
The underlying architecture uses the ASN.1 data format for precise representation of biological information. Records are organized under a structured flat file format with defined fields for locus, definition, and accession number. The database is built on a robust relational database management system and is mirrored internationally at sites like the European Bioinformatics Institute. It interoperates with the Sequence Read Archive for raw data and the BLAST suite for sequence similarity searching.
Category:Bioinformatics Category:Biological databases Category:National Institutes of Health