GEO (Gene Expression Omnibus)

GEO (Gene Expression Omnibus)
Name	Gene Expression Omnibus
Producer	National Center for Biotechnology Information
Country	United States
History	2000–present
Disciplines	Genomics, Transcriptomics
Cost	Free

Contents

Overview
Data Content and Formats
Submission and Curation
Access and Tools
Applications and Impact
Limitations and Criticisms

GEO (Gene Expression Omnibus) is a public functional genomics data repository maintained by the National Center for Biotechnology Information. It archives high-throughput gene expression and hybridization array data alongside next-generation sequencing platforms, supporting reuse by researchers affiliated with institutions like Harvard University, Stanford University, Massachusetts Institute of Technology, University of Cambridge, and University of Oxford. The resource interoperates with databases such as NCBI, GenBank, RefSeq, UniProt, and PubMed Central to enable integrative analyses.

Overview

GEO was launched within the National Institutes of Health ecosystem and is operated by the National Library of Medicine, aligning with data-sharing policies from agencies including the National Science Foundation and the European Research Council. It serves projects from consortia such as the ENCODE Project, the 1000 Genomes Project, the The Cancer Genome Atlas, the International HapMap Project, and the Human Cell Atlas. Contributors span organizations like Broad Institute, Wellcome Sanger Institute, Cold Spring Harbor Laboratory, European Bioinformatics Institute, and Johns Hopkins University. GEO interacts with journals including Nature, Science, Cell, PLoS ONE, and Genome Research to satisfy publication data policies.

Data Content and Formats

GEO stores series of records covering microarray experiments from vendors such as Affymetrix, Agilent Technologies, and Illumina as well as sequencing datasets from platforms by Illumina and Oxford Nanopore Technologies. Data types include expression matrices, raw CEL files, FASTQ reads, BAM alignments, and processed count tables annotated with metadata standards influenced by MIAME and MINSEQE. Metadata fields reference controlled vocabularies and ontologies curated by organizations such as the Open Biological and Biomedical Ontology Foundry, Gene Ontology Consortium, Human Phenotype Ontology, and Sequence Ontology. Data formats are compatible with tools from Bioconductor, Galaxy, UCSC Genome Browser, Ensembl, and IGV.

Submission and Curation

Submitters register through NCBI accounts and provide experiment descriptions, sample annotations, and series information, following guidelines akin to policies from National Institutes of Health, Wellcome Trust, and European Molecular Biology Laboratory. Curation involves staff at the National Center for Biotechnology Information and automated pipelines that map identifiers to resources like Entrez Gene, RefSeq, UniProt, and SRA (Sequence Read Archive). Large-scale projects coordinate submission practices exemplified by GTEx Consortium, ENCODE Project Consortium, and 1000 Genomes Project Consortium. Data release timing frequently aligns with journal embargoes from publishers such as Elsevier, Springer Nature, and Wiley-Blackwell.

Access and Tools

Users retrieve data via web interfaces, programmatic access using Entrez Programming Utilities, and bulk downloads through the Sequence Read Archive and FTP services mirrored by centers including European Nucleotide Archive and DNA Data Bank of Japan. Analytical pipelines integrate with Bioconductor packages like limma, DESeq2, and edgeR as well as platforms from GenePattern, Cavatica, and Seven Bridges Genomics. Visualization and reanalysis are supported by third-party resources such as GEOmetadb, GEO2R, Expression Atlas, cBioPortal, and tools developed by teams at Broad Institute and European Bioinformatics Institute.

Applications and Impact

Researchers use GEO datasets for meta-analyses, biomarker discovery, and validation in studies associated with institutions like Mayo Clinic, Memorial Sloan Kettering Cancer Center, Dana-Farber Cancer Institute, Karolinska Institutet, and Scripps Research. GEO has contributed to advances in oncology, immunology, and neuroscience cited in works from American Association for the Advancement of Science, European Society for Medical Oncology, and Society for Neuroscience. It underpins computational method development by groups at Princeton University, Yale University, University of California, San Francisco, University of California, Berkeley, and California Institute of Technology. Regulatory science and reproducibility initiatives from Food and Drug Administration and European Medicines Agency have cited public repositories like GEO as infrastructure for data transparency.

Limitations and Criticisms

Critiques focus on variable metadata quality, inconsistent adherence to MIAME/MINSEQE standards, and incomplete sample annotations from contributors at diverse organizations including academic labs and industry partners like Roche and Pfizer. Issues such as batch effects, platform heterogeneity, and selective deposition complicate cross-study comparisons and are discussed in literature from journals like Bioinformatics, Nature Methods, and Genome Biology. Computational reproducibility challenges prompt links to resources at OpenAI-adjacent toolmakers and community efforts led by groups at Software Carpentry and Mozilla Science Lab to improve standards. Privacy concerns around human genomic data have engaged stakeholders including Institutional Review Board (IRB), Council for International Organizations of Medical Sciences, and national agencies such as Health Canada.

Category:Biological databases