GenBank — LLMpedia

GenBank
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	GenBank
Developer	National Center for Biotechnology Information
Released	0 1982
Genre	Nucleotide sequence database
License	Public domain

Contents

Overview
History
Data content and organization
Data submission and retrieval
Relationship to other databases
Impact and use in research

GenBank. It is a comprehensive public database of DNA sequences and associated biological information, maintained by the National Center for Biotechnology Information (NCBI) within the National Library of Medicine (NLM), part of the National Institutes of Health (NIH). As a foundational resource for the field of bioinformatics, it archives and freely distributes nucleotide sequences from more than 500,000 formally described species, supporting a vast array of research in genomics, molecular biology, and evolutionary biology. The database operates as part of the International Nucleotide Sequence Database Collaboration (INSDC), ensuring global data synchronization with partners like the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) and the DNA Data Bank of Japan (DDBJ).

Overview

GenBank serves as the NIH's primary repository for genetic sequence data submitted directly by individual researchers and large-scale sequencing centers, including those involved in major projects like the Human Genome Project. Each record in the database represents a single, contiguous stretch of DNA or RNA with supporting annotations, such as the coding regions for proteins and references to published literature. The resource is integral to modern biological discovery, enabling scientists to compare newly determined sequences with those from other organisms, which can reveal evolutionary relationships, identify genes, and predict gene function. Access to the database is provided through the Entrez retrieval system, which links sequences to related data in other NCBI resources like PubMed and Protein Data Bank.

History

The origins of GenBank trace back to the late 1970s, with early theoretical work by visionaries like Walter Goad at the Los Alamos National Laboratory. Officially established in 1982 through a contract from the NIH to Bolt, Beranek and Newman, the database was initially managed by the Los Alamos National Laboratory before its stewardship was transferred to the newly created National Center for Biotechnology Information in 1992. This move, championed by figures such as Senator Claude Pepper and James B. Wyngaarden, then director of the NIH, was a response to the growing data deluge from advancements in DNA sequencing technology. Throughout the 1990s and 2000s, the exponential growth of data from initiatives like the Human Genome Project and the rise of next-generation sequencing fundamentally transformed the scale and importance of the database.

Data content and organization

The core data within GenBank are sequence records, each identified by a unique accession number that provides a stable identifier for citation and retrieval. Records are organized into divisions, such as PRI for primate sequences or BCT for bacterial sequences, and include detailed annotations like the coding sequence (CDS) features that define protein products. These annotations are often linked to other databases, such as the RefSeq project, which provides curated reference sequences, and the Taxonomy Database, which provides standardized phylogenetic classifications. The underlying data format follows the standardized FASTA format and the more feature-rich INSDC feature table definition, ensuring compatibility with a wide array of bioinformatics software tools used by the global research community.

Data submission and retrieval

Researchers submit sequences to GenBank using tools like BankIt for simple submissions or the more powerful Sequin software for complex annotations, with all data flowing through the Submission Portal of the NCBI. Once submitted, records are processed by NCBI staff who assign accession numbers and perform basic validation before release; major sequencing centers often submit data via high-throughput pipelines. For retrieval, scientists primarily use the Entrez system, which allows for powerful queries using terms like organism, gene name, or author. Programmatic access is enabled through services like BLAST for sequence similarity searches and the Entrez Programming Utilities (E-utilities), which are critical for large-scale data analysis projects in computational biology.

Relationship to other databases

GenBank is a cornerstone of the International Nucleotide Sequence Database Collaboration (INSDC), a long-standing partnership with the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) and the DNA Data Bank of Japan (DDBJ). This collaboration ensures that data submitted to any one of the three partners is exchanged daily and mirrored across all, providing a single, unified global resource. Within the NCBI ecosystem, it is tightly linked to derived and curated databases like RefSeq, Protein Data Bank, and the Gene Expression Omnibus (GEO). These connections allow researchers to move seamlessly from a DNA sequence to related protein structures, gene expression profiles, and literature citations in PubMed.

Impact and use in research

The availability of GenBank has revolutionized biological research, underpinning countless discoveries in fields ranging from medicine and agriculture to conservation biology and forensic science. It is the primary data source for tools like BLAST, which is used billions of times each year to identify homologous sequences and infer function, a routine step in characterizing genes from pathogens like SARS-CoV-2 during the COVID-19 pandemic. The database has been fundamental to large-scale comparative genomics projects, such as those tracing the evolution of H5N1 influenza virus or identifying genetic markers in crops like Oryza sativa (rice). Its role as an open-access repository exemplifies the Bermuda Principles for data sharing, accelerating the pace of scientific discovery worldwide.

Category:Bioinformatics Category:Biological databases Category:National Institutes of Health Category:1982 establishments in the United States