NCBI Sequence Read Archive

NCBI Sequence Read Archive
Name	Sequence Read Archive
Title	NCBI Sequence Read Archive
Producer	National Center for Biotechnology Information
Country	United States
History	2007–present
Discipline	Genomics
Depth	Primary sequencing reads
Formats	SRA, FASTQ, BAM

Contents

Overview
History and development
Data model and content
Submission and accessioning
Access methods and tools
Data formats and standards
Usage and impact

NCBI Sequence Read Archive is a large public repository for raw high-throughput sequencing data hosted by the National Center for Biotechnology Information. It serves as a central archival resource for sequencing reads generated by projects affiliated with institutions such as the National Institutes of Health, the Wellcome Trust Sanger Institute, the Broad Institute, and the European Bioinformatics Institute. Researchers from organizations including the Max Planck Society, Cold Spring Harbor Laboratory, Howard Hughes Medical Institute, and the Smithsonian Institution routinely deposit data to support reproducibility in studies linked to journals like Nature, Science, and Cell.

Overview

The Sequence Read Archive stores primary sequencing output from platforms developed by companies such as Illumina, Oxford Nanopore Technologies, Pacific Biosciences, and Thermo Fisher Scientific, enabling reuse in meta-analyses, method development, and replication studies cited by authors from Harvard University, Stanford University, Massachusetts Institute of Technology, University of Oxford, and University of Cambridge. The archive interoperates with databases maintained by the European Molecular Biology Laboratory, European Nucleotide Archive, DNA Data Bank of Japan, and the Global Alliance for Genomics and Health to facilitate international data sharing among groups including the Human Genome Project, ENCODE, 1000 Genomes Project, and the Human Microbiome Project.

History and development

The archive originated amid collaborations among the National Library of Medicine, National Center for Biotechnology Information, and groups connected to the International Nucleotide Sequence Database Collaboration that includes EMBL-EBI and DDBJ, following precedents set by projects such as the Human Genome Project and the HapMap Project. Major milestones involved integration of data from consortia like The Cancer Genome Atlas and the Genotype-Tissue Expression project, and software transitions influenced by contributions from institutions such as the Broad Institute, Wellcome Trust Sanger Institute, and Cold Spring Harbor Laboratory. Policy shifts and technical enhancements referenced guidance from the National Institutes of Health, Office of Science and Technology Policy, and research funders including the Wellcome Trust and the European Commission.

Data model and content

SRA’s data model organizes records by BioProject, BioSample, Experiment, Run, and Submission entities, aligning with metadata schemas developed by the National Center for Biotechnology Information and harmonized with standards employed by the European Molecular Biology Laboratory and DNA Data Bank of Japan. Content spans whole-genome sequencing, RNA-seq, ChIP-seq, single-cell RNA-seq, metagenomics, and targeted sequencing contributed by consortia such as ENCODE, Roadmap Epigenomics Project, International Cancer Genome Consortium, and GTEx. Sample descriptions often reference organismal information from the International Nucleotide Sequence Database Collaboration, strain and specimen data from the Smithsonian Institution and Natural History Museum, and controlled vocabularies influenced by the Gene Ontology Consortium and the Sequence Ontology.

Submission and accessioning

Submitters from universities like Yale University, Princeton University, University of California system campuses, and research institutes such as the Broad Institute and Sanger Institute use submission tools provided by the National Center for Biotechnology Information to deposit data using BioProject and BioSample identifiers. Accession numbers are assigned in coordination with database partners such as EMBL-EBI and DDBJ to ensure citation in publications appearing in journals like Nature Genetics, Genome Research, and PLOS Genetics. Data submissions follow policies influenced by funding agencies including the National Institutes of Health, Wellcome Trust, and European Research Council; compliance and embargo workflows reflect practices from publishers including Springer Nature, Elsevier, and Oxford University Press.

Access methods and tools

Users retrieve datasets via web interfaces and programmatic access through APIs and command-line utilities provided by the National Center for Biotechnology Information, leveraging tools developed by collaborators at the European Bioinformatics Institute, the Broad Institute, and the Galaxy Project. Common software that interacts with SRA content includes SAMtools maintained by The Genome Analysis Toolkit communities, BWA from the Broad Institute, HISAT2 with authors affiliated to Johns Hopkins University, and FASTQC used across research groups worldwide. Computational platforms such as Amazon Web Services, Google Cloud Platform, and national supercomputing centers at Lawrence Berkeley National Laboratory and Oak Ridge National Laboratory enable large-scale reanalysis of SRA datasets.

Data formats and standards

Primary file formats include the SRA native format and conversions to FASTQ, BAM, and CRAM, with standards and specifications influenced by the Global Alliance for Genomics and Health, the International Nucleotide Sequence Database Collaboration, and community bodies such as the Bioinformatics Open Source Conference. Metadata standards draw on BioSample and BioProject schemas from the National Center for Biotechnology Information and ontologies developed by the Gene Ontology Consortium and the Open Biological and Biomedical Ontology Foundry, facilitating interoperability with resources like UniProt, RefSeq, and Ensembl maintained by the European Molecular Biology Laboratory and EMBL-EBI.

Usage and impact

SRA underpins reproducible research cited by authors at institutions including Harvard Medical School, University of California San Diego, Massachusetts General Hospital, and Karolinska Institutet, and supports large-scale studies led by consortia such as the 100,000 Genomes Project and the Global Ocean Sampling expedition. Its datasets have enabled discoveries reported in journals like Nature, Science, Cell, and The Lancet, informed public health responses at agencies such as the Centers for Disease Control and Prevention and World Health Organization, and catalyzed tool development by projects associated with the Broad Institute, EMBL-EBI, and the Galaxy Project. The archive continues to shape genomics research practices in alignment with policies from the National Institutes of Health, Wellcome Trust, and European Commission, and to interoperate with archives managed by EMBL-EBI and DDBJ.

Category:Biological databases