Genome Aggregation Database

Genome Aggregation Database
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Genome Aggregation Database
Abbreviation	gnomAD
Established	2014
Type	Database
Discipline	Human genetics
Country	International

Contents

Introduction
History and Development
Data Sources and Cohorts
Data Processing and Quality Control
Access, Usage, and Licensing
Scientific Impact and Applications
Limitations and Criticisms

Genome Aggregation Database

The Genome Aggregation Database is an international human genetic variation resource aggregating large-scale sequencing data from multiple consortia and projects to provide allele frequency information for clinical and population genetics. It supports clinical interpretation, population genetics research, and variant filtering by providing aggregated exome and genome datasets derived from diverse cohorts contributed by research institutions, hospitals, and biobanks. The resource interacts with major projects and institutions to harmonize data across platforms, enabling reproducible variant annotation and comparative analyses.

Introduction

The database aggregates variant frequency data drawn from projects such as the 1000 Genomes Project, UK Biobank, Exome Aggregation Consortium, The Cancer Genome Atlas, and national biobanks including Estonian Biobank and All of Us Research Program. It provides allele counts, allele frequencies, genotype counts, and population-specific summaries derived from sequencing performed by centers like the Broad Institute, Wellcome Sanger Institute, and Genome Institute at Washington University in St. Louis. The resource is widely used alongside annotation tools and reference resources such as Ensembl, RefSeq, dbSNP, ClinVar, and OMIM for variant interpretation.

History and Development

Initial efforts grew from collaborations among groups behind the Exome Variant Server, 1000 Genomes Project Consortium, and the Exome Aggregation Consortium to address the need for large control datasets following high-throughput sequencing advances led by platforms developed by Illumina, Thermo Fisher Scientific, and groups at the Broad Institute. Subsequent versions expanded through partnerships with large-scale initiatives including the GenomeAsia 100K Project, Icelandic deCODE Genetics, and national population efforts such as the FinnGen study and projects from the National Institutes of Health and European Molecular Biology Laboratory. Key contributors and leaders have included investigators affiliated with Harvard University, Massachusetts Institute of Technology, Stanford University, University of Cambridge, and University of Oxford.

Data Sources and Cohorts

Aggregated datasets incorporate sequencing from clinical genetics programs at institutions like Mayo Clinic, Children's Hospital of Philadelphia, and Great Ormond Street Hospital, cancer genomics from initiatives such as The Cancer Genome Atlas and International Cancer Genome Consortium, and population cohorts from biobanks including UK Biobank, Estonian Biobank, BioBank Japan, and regional studies led by teams at Karolinska Institutet and University of Toronto. Collaborative projects with companies and consortia such as deCODE genetics, Genomics England, All of Us Research Program, and research groups at Sanger Institute contributed exome and whole-genome sequences from diverse ancestries including African, East Asian, South Asian, Latino, and European samples. The resource excludes samples with severe pediatric disease phenotypes from many clinical projects to better represent general population variation.

Data Processing and Quality Control

Raw sequencing data from contributors are processed with pipelines and tools developed by teams at the Broad Institute and partners, employing aligners and variant callers such as BWA, GATK, and read processing tools used by groups at University of California, Santa Cruz. Quality control leverages metrics and standards from organizations including the Global Alliance for Genomics and Health, and benchmarks against references like the Genome in a Bottle consortium. Filters and annotations are applied using resources such as Variant Effect Predictor, SIFT, PolyPhen-2, and databases like dbNSFP to flag likely artifacts, assess allele balance, and remove low-confidence calls. Population structure and relatedness analyses reference methods and datasets from studies led by investigators at University of Michigan and University of Washington.

Access, Usage, and Licensing

Access policies evolved through agreements among contributing institutions including the Broad Institute, Wellcome Trust, and national funders like the National Institutes of Health and Medical Research Council. Public browser access enables queries and downloads for clinicians and researchers, while detailed genotype-level data often require controlled access through data access committees modeled on frameworks used by dbGaP and European Genome-phenome Archive. Licensing and data use limitations reflect participant consent frameworks similar to those negotiated by Biobank Japan and UK Biobank, and align with privacy and governance guidance from entities like the Global Alliance for Genomics and Health and regulatory frameworks in jurisdictions including United States and European Union.

Scientific Impact and Applications

The database has become integral to clinical variant interpretation workflows used by laboratories affiliated with American College of Medical Genetics and Genomics, Clinical Genome Resource, and major clinical genetics services at Mayo Clinic and Massachusetts General Hospital. It underpins population genetics analyses published by groups at Harvard University, Stanford University, and University of California, Berkeley, informs studies of selection and demographic history drawing on comparative data from 1000 Genomes Project and Human Genome Diversity Project, and supports rare disease gene discovery in collaborations including Deciphering Developmental Disorders and consortia such as Epi4K. The resource has been cited in disease-specific research spanning oncology with The Cancer Genome Atlas, cardiology with datasets from Framingham Heart Study, and neurogenetics linked to work at Broad Institute and Cold Spring Harbor Laboratory.

Limitations and Criticisms

Critiques highlight population sampling imbalances despite inclusion of cohorts like BioBank Japan and FinnGen, echoing concerns raised in discussions involving Human Genome Diversity Project and advocates from institutions such as Wellcome Sanger Institute and Karolinska Institutet. Remaining technical limitations relate to integration of heterogeneous sequencing platforms from vendors such as Illumina, batch effects noted by analysts at Broad Institute and challenges in calling structural variation where groups at Genome Institute at Washington University and deCODE genetics emphasize the need for long-read data from technologies by Pacific Biosciences and Oxford Nanopore Technologies. Ethical debates involve consent frameworks and benefit-sharing discussed by policy bodies including Global Alliance for Genomics and Health and funders like the National Institutes of Health.

Category:Genetics databases