LLMpediaThe first transparent, open encyclopedia generated by LLMs

INSDC

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Parent: DNA Data Bank of Japan Hop 4
Expansion Funnel Raw 52 → Dedup 0 → NER 0 → Enqueued 0
1. Extracted52
2. After dedup0 (None)
3. After NER0 ()
4. Enqueued0 ()
INSDC
NameInternational Nucleotide Sequence Database Collaboration
AbbreviationINSDC
Formed1988
HeadquartersTokyo; Washington, D.C.; Hinxton
MembersNational Center for Biotechnology Information, European Molecular Biology Laboratory, DNA Data Bank of Japan

INSDC The International Nucleotide Sequence Database Collaboration aggregates and synchronizes nucleotide sequence records among major public archives to support GenBank-style accessioning, global bioinformatics, and open data reuse. It provides a coordinated framework linking submission, curation, and distribution carried out by partner institutions in North America, Europe, and Asia, facilitating interoperability with reference projects, clinical initiatives, and biodiversity programs. The collaboration underpins large-scale sequencing efforts and cross-references with taxonomic, functional, and publication resources used by researchers worldwide.

Overview

The collaboration functions as a tripartite alliance among three institutional archives: the US-based National Center for Biotechnology Information, the European-based European Molecular Biology LaboratoryEuropean Bioinformatics Institute, and the Japanese DNA Data Bank of Japan. Its synchronized holdings include raw reads, assembled genomes, transcriptomes, and annotated features used by projects such as Human Genome Project, 1000 Genomes Project, International Cancer Genome Consortium, and Global Virome Project. The archives coordinate accession numbers, flat-file formats, and metadata models that enable integration with consortiums like Ensembl, UniProt, and Global Biodiversity Information Facility.

History and Development

Origins trace to pre-1980s sequence sharing among laboratories and to early database efforts such as EMBL Data Library and early GenBank exchanges. Formal collaboration began in the late 1980s to avoid duplication between archives maintained by institutions including those associated with Cold Spring Harbor Laboratory and European Molecular Biology Organization initiatives. Over decades, the collaboration adapted to technological changes driven by platforms from Illumina and Pacific Biosciences to long-read systems from Oxford Nanopore Technologies, and aligned with community standards emerging from meetings at venues like Wellcome Trust summits and International Congress of Genetics. Key milestones include adoption of shared accessioning, support for high-throughput sequencing submissions from consortia such as Human Microbiome Project and international responses to outbreaks such as SARS-CoV-2.

Organization and Participating Databases

Operational responsibilities are distributed: the National Center for Biotechnology Information manages services that interoperate with PubMed and RefSeq, the European Bioinformatics Institute hosts mirrors and tools within the European Genome-phenome Archive ecosystem, and the DNA Data Bank of Japan provides regional submission support and curation. Each partner maintains complementary resources—submission portals, validation pipelines, and specialized databases—that integrate with external entities including GenBank, PDB, and domain-specific archives like dbSNP and ArrayExpress. Collaborative governance involves liaison with funders and global initiatives such as Horizon 2020 and agencies like NIH.

Data Types and Submission Standards

Archives accept sequence data classes from raw sequence reads produced by platforms from Illumina and Oxford Nanopore Technologies to assembled genomes and annotated feature tables used in projects such as Genome Reference Consortium. Submission standards cover metadata elements—sample descriptors, collection dates, and geographic provenance—aligned with checklists from groups like MIxS and interoperability profiles referencing ontologies such as NCBI Taxonomy and crosswalks to resources like Catalogue of Life. Controlled vocabularies and structured qualifiers ensure compatibility with downstream annotation pipelines used by ENSEMBL and variant resources such as ClinVar.

Access, Tools, and Data Formats

Data distribution uses formats and services compatible with community tools: flat-file records, XML, FASTA, FASTQ, and BAM/CRAM for alignments, with accessioning that integrates into search portals like Entrez and programmatic interfaces such as EBI Search APIs. Visualization and analysis tools link records to browsers and platforms including UCSC Genome Browser, Ensembl, and workflow systems employed in projects like Galaxy and Nextflow-based pipelines. Bulk access mechanisms, cloud mirrors, and FTP/Aspera endpoints facilitate large-scale use by consortia such as All of Us Research Program and pandemic surveillance initiatives.

Governance, Policies, and Data Sharing Practices

Policy frameworks prioritize open access while respecting ethical constraints imposed by participant consent and regulatory regimes such as those influencing data deposited for clinical studies under oversight bodies like European Medicines Agency or national authorities such as Food and Drug Administration. Embargo options, controlled-access mechanisms via repositories akin to European Genome-phenome Archive, and data-use agreements enable compliance with privacy rules and bilateral funder mandates from institutions like Wellcome Trust and Bill & Melinda Gates Foundation. Standards development occurs through liaison with community organizations including Genomic Standards Consortium and international committees convened by bodies such as World Health Organization.

Impact and Use in Research and Public Health

The coordinated archive network accelerates research in genomics, epidemiology, and conservation by providing reference datasets used by projects like Human Genome Project, Earth BioGenome Project, and variant surveillance programs such as tracking of influenza and SARS-CoV-2 lineages. Clinical and public health agencies rely on archived sequences for outbreak investigation, vaccine design, and surveillance partnerships involving Centers for Disease Control and Prevention and national public-health laboratories. The collaboration’s integration with annotation and protein resources influences translational research in precision medicine initiatives exemplified by NIH All of Us Research Program and multinational consortia addressing antimicrobial resistance.

Category:Biological databases