Sequence Read Archive

Sequence Read Archive
Name	Sequence Read Archive
Type	biological database
Owner	National Center for Biotechnology Information
Established	2007
Country	United States
Access	public

Contents

Overview
History and Development
Data Content and Organization
Submission and Access Policies
Data Formats, Tools, and Processing
Usage and Impact
Governance and Preservation Strategies

Sequence Read Archive

The Sequence Read Archive is a public repository for high-throughput sequencing data integrating raw reads, metadata, and workflows from diverse projects. It supports deposition and retrieval of datasets from large consortia, academic laboratories, and industry partnerships worldwide, enabling reproducible analyses across platforms and studies. The archive interoperates with major bioinformatics resources and contributes to standards used by funders and publishers.

Overview

The archive aggregates primary sequencing data submitted by projects such as the 1000 Genomes Project, Human Microbiome Project, ENCODE Project, Earth Microbiome Project, and Genome 10K Project, linking to related records in GenBank, RefSeq, European Nucleotide Archive, DNA Data Bank of Japan, and repositories used by National Institutes of Health, Wellcome Sanger Institute, Broad Institute, European Molecular Biology Laboratory, and European Bioinformatics Institute. It stores raw reads from platforms marketed by Illumina, Pacific Biosciences, Oxford Nanopore Technologies, Roche, and historical instruments developed by Applied Biosystems. The archive supports cross-references to specimen vouchers housed in institutions such as the Smithsonian Institution, Natural History Museum, London, American Museum of Natural History, and sequence-linked collections like Barcode of Life Data System. Major stakeholders include funders and publishers such as the National Science Foundation, Howard Hughes Medical Institute, Wellcome Trust, Nature Publishing Group, Science (journal), and PLOS.

History and Development

The archive originated during collaborations among National Center for Biotechnology Information, European Bioinformatics Institute, and DNA Data Bank of Japan to address the exponential growth of next-generation sequencing after projects like Human Genome Project and the spread of platforms from Illumina and Roche. Early milestones coincide with the publication of datasets from the 1000 Genomes Project and initiatives led by the Broad Institute and Wellcome Sanger Institute. Policies evolved under influence from agencies including the National Institutes of Health, European Commission, and funders such as Wellcome Trust and Gates Foundation. Key technical transitions tracked the adoption of file standards promoted by consortia with participants from Genome Analysis Centre and firms like Google and Amazon Web Services that enabled cloud distribution and mirrored archives at centers like European Nucleotide Archive.

Data Content and Organization

Content categories span whole-genome sequencing, transcriptomics, metagenomics, epigenomics, single-cell sequencing, and targeted resequencing from projects including ENCODE Project, GTEx Project, Cancer Genome Atlas, Human Microbiome Project, and biodiversity surveys like Barcode of Life Data System initiatives. Records are structured into hierarchies comparable to submissions at GenBank and European Nucleotide Archive with descriptors linking to specimen collections at institutions such as the Smithsonian Institution and Natural History Museum, London. Metadata fields echo standards devised by groups like the Genomic Standards Consortium and are used by curators at National Center for Biotechnology Information, European Bioinformatics Institute, and research groups at Stanford University, Harvard University, University of Cambridge, and Massachusetts Institute of Technology.

Submission and Access Policies

Submission pipelines reflect requirements set by funders and publishers such as the National Institutes of Health, Wellcome Trust, Howard Hughes Medical Institute, Nature Publishing Group, and PLOS. Data access models balance open sharing with controlled access for human genomic data through systems analogous to those used by the Database of Genotypes and Phenotypes and governance influenced by regulations like Health Insurance Portability and Accountability Act. Submitter institutions include universities like University of Oxford, Yale University, Columbia University, and corporate research groups at Illumina and Google DeepMind. Policies incorporate consent frameworks developed in dialogues with bodies such as the Global Alliance for Genomics and Health and funders including the Bill & Melinda Gates Foundation.

Data Formats, Tools, and Processing

The archive accepts sequence file formats produced by vendors including Illumina, Pacific Biosciences, and Oxford Nanopore Technologies, and leverages compression and container formats championed by projects at Broad Institute, European Bioinformatics Institute, and companies like Amazon Web Services and Google. Common tools in downstream analysis include pipelines and software developed by groups at Broad Institute (for example, workflows used in Genome Analysis Toolkit), community projects from Bioconductor, and command-line utilities maintained by contributors from European Molecular Biology Laboratory and Stanford University. Data indexing and retrieval interact with cloud platforms used by Amazon Web Services, Google Cloud Platform, and compute resources at National Institutes of Health.

Usage and Impact

Researchers from institutions such as Harvard University, Massachusetts Institute of Technology, University of California, Berkeley, and University of Tokyo use the archive for reanalysis, meta-analysis, and methods development across fields influenced by landmark studies like the 1000 Genomes Project and ENCODE Project. The archive underpins discoveries published in venues including Nature, Science (journal), Cell (journal), and specialty journals, and supports public health responses coordinated by organizations like the Centers for Disease Control and Prevention, World Health Organization, and regional agencies. It has facilitated large-scale efforts in pathogen genomics seen in responses to outbreaks investigated by teams at London School of Hygiene & Tropical Medicine and Johns Hopkins University.

Governance and Preservation Strategies

Governance involves coordinating institutions including the National Center for Biotechnology Information, European Bioinformatics Institute, and DNA Data Bank of Japan, with funding and policy input from National Institutes of Health, Wellcome Trust, European Commission, and foundations like the Gates Foundation. Preservation strategies draw on archival practices at organizations such as the Smithsonian Institution and technical collaborations with cloud providers including Amazon Web Services and Google Cloud Platform to ensure long-term accessibility, format migration, and mirroring with counterparts like the European Nucleotide Archive.

Category:Biological databases