GenBank — LLMpedia

GenBank
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	GenBank
Type	Biological sequence database
Owner	National Center for Biotechnology Information
Country	United States
Established	1982

Contents

History
Organization and Structure
Data Submission and Accessioning
Sequence Formats and File Types
Data Access and Tools
Quality Control and Annotation
Impact and Applications

GenBank is a comprehensive public repository of nucleotide sequences and associated metadata. It functions as a central resource for researchers working with DNA, RNA, and related molecular information, supporting comparative genomics, phylogenetics, and biotechnology research. The resource is maintained by an institute that operates within a larger federal research organization and collaborates with international partners to synchronize sequence archives.

History

GenBank traces its origins to early nucleotide sequencing projects at institutions such as Cold Spring Harbor Laboratory, Salk Institute, and collaborations involving researchers affiliated with National Institutes of Health, National Library of Medicine, and academic groups from Harvard University and Massachusetts Institute of Technology. The project expanded through partnerships with databases developed at European Molecular Biology Laboratory and DNA Data Bank of Japan to form an international sequence exchange network. Landmark initiatives including the Human Genome Project, the International HapMap Project, and efforts by laboratories like Broad Institute drove increases in submissions and policy development. Key policy milestones involved coordinating public access standards influenced by panels convened by agencies such as National Research Council and policy makers associated with Office of Science and Technology Policy.

Organization and Structure

GenBank is operated by a division within an institute of the National Library of Medicine housed at the National Institutes of Health. Its organizational model includes teams responsible for database curation, software engineering, user support, and collaboration with counterpart organizations at European Bioinformatics Institute, EMBL-EBI, and international centers such as DDBJ. Governance involves advisory groups drawing members from universities like Stanford University, University of Cambridge, University of California, Berkeley, and research institutes including Wellcome Sanger Institute and Max Planck Society. Funding streams originate from federal appropriations, cooperative agreements with agencies such as National Science Foundation, and collaborative grants with consortia like ELIXIR.

Data Submission and Accessioning

Submitters from research institutions including University of Oxford, Yale University, Johns Hopkins University, and industrial laboratories at companies such as Pfizer and Illumina deposit sequence data through submission systems maintained by the managing institute. Each submission receives a unique accession identifier that is used by journals like Nature, Science, Cell, and repositories tied to projects such as 1000 Genomes Project and The Cancer Genome Atlas for citation and reproducibility. Policies for human data submissions are coordinated with ethics bodies and regulatory entities such as Institutional Review Board offices at academic medical centers and data access committees associated with large consortia.

Sequence Formats and File Types

Sequence records are stored in archival formats that have evolved alongside standards from organizations like International Nucleotide Sequence Database Collaboration and technical committees tied to World Wide Web Consortium. Common file types used for distribution and interoperability include flatfile and exchange formats adopted by software projects developed at European Molecular Biology Laboratory and utilities created at National Center for Biotechnology Information. Tools such as assemblers from Broad Institute and annotation pipelines influenced format conventions, and format documentation interfaces with platforms hosted by groups at University of California, Santa Cruz and teams behind the Ensembl project.

Data Access and Tools

Data access is provided through web portals, programmatic interfaces used by developers at companies like Google and research groups at IBM Research, and integration with analysis platforms from vendors such as Thermo Fisher Scientific. Search and retrieval services interoperate with comparison tools developed by institutes including European Bioinformatics Institute and visualization systems originating from teams at University of Washington. Public APIs and batch download mechanisms support high-throughput workflows used by consortia like Global Alliance for Genomics and Health and computational groups at Carnegie Mellon University.

Quality Control and Annotation

Quality control procedures draw on community standards shaped by panels involving representatives from American Society for Microbiology, International Society for Computational Biology, and academic laboratories at Massachusetts General Hospital. Annotation workflows incorporate automated prediction algorithms developed in collaborations with research groups at Stanford University School of Medicine and manual curation by biocurators trained at institutions like Smithsonian Institution. Error correction and versioning policies align with practices promoted by organizations such as International Nucleotide Sequence Database Collaboration and stewardship guidance from the National Academy of Sciences.

Impact and Applications

The archive underpins discoveries reported in outlets such as The Lancet, New England Journal of Medicine, and Proceedings of the National Academy of Sciences of the United States of America, enabling research in comparative genomics by teams at Sanger Institute and translational projects at Mayo Clinic. It has been instrumental for surveillance programs run by agencies like Centers for Disease Control and Prevention and international efforts coordinated with World Health Organization. Applications span agriculture research at United States Department of Agriculture, biotechnology ventures at Genentech, conservation studies by organizations such as World Wildlife Fund, and education initiatives at universities including Cornell University and University of Michigan.

Category:Biological databases