GWAS Catalog — LLMpedia

GWAS Catalog
Name	GWAS Catalog
Type	Biological database
Owner	European Bioinformatics Institute
Country	United Kingdom
Established	2008
Access	Public

Contents

Introduction
History and development
Scope and data content
Curation and quality control
Access and tools
Usage and impact
Limitations and challenges

GWAS Catalog The GWAS Catalog is a curated, publicly accessible database that aggregates published associations between genetic variants and human traits. It facilitates cross-references among studies from major journals and consortia and supports integrative analyses across cohorts, biobanks, and research programs. The resource underpins translational research in genetics by linking variant-level findings to biomedical resources and community standards.

Introduction

The Catalog aggregates summary-level association data from genome-wide association studies reported in outlets such as Nature (journal), Science (journal), The Lancet, New England Journal of Medicine, and PLOS Genetics. It interoperates with repositories and initiatives including European Molecular Biology Laboratory, European Bioinformatics Institute, National Center for Biotechnology Information, Ensembl, ClinVar, and dbSNP. The project engages with consortia and cohorts like the UK Biobank, 1000 Genomes Project, International HapMap Project, GIANT (genetics consortium), and DIAGRAM (consortium) to contextualize associations.

History and development

Initial conceptual work drew on frameworks from projects such as the International HapMap Project and the Human Genome Project, with early implementation influenced by teams at the European Bioinformatics Institute and collaborations with the National Human Genome Research Institute. Key milestones include alignment with standards from the Genomics Standards Consortium and integration initiatives involving Global Alliance for Genomics and Health and the ELIXIR infrastructure. The Catalog evolved alongside large-scale initiatives such as the Wellcome Trust Case Control Consortium and data-sharing policies shaped by funders like the Wellcome Trust and the National Institutes of Health.

Scope and data content

Content spans published single-nucleotide polymorphism associations, effect sizes, p-values, sample descriptors, and mapped genes derived from studies involving cohorts such as the Framingham Heart Study, Women's Health Initiative, Rotterdam Study, and Million Veteran Program. The Catalog links reported variants to reference resources including dbSNP, Ensembl, RefSeq, and annotation sources like Gene Ontology and Reactome. It indexes phenotypes using ontologies and controlled vocabularies developed in collaboration with stakeholders such as the Monarch Initiative and the Human Phenotype Ontology.

Curation and quality control

Curation pipelines combine manual extraction by domain curators with automated validation using cross-references to databases like PubMed, CrossRef, ORCID, and author affiliations tied to institutions such as Harvard University, University of Oxford, Stanford University, and Broad Institute. Quality control includes checks for genome build consistency with resources like GRCh37 and GRCh38, allele harmonization against dbSNP, and statistical thresholding informed by community consensus from groups including International HapMap Project collaborators and the American Society of Human Genetics. Provenance metadata capture links to publishing venues like BMJ and funding acknowledgments to organizations such as the European Research Council.

Access and tools

Users access data via a web interface and programmatic APIs, with tooling that interoperates with platforms such as Ensembl, UCSC Genome Browser, Galaxy (platform), and workflow systems used at centers like Wellcome Sanger Institute and Johns Hopkins University. Download formats enable integration with analysis tools from teams at Stanford University, Broad Institute, University of Cambridge, and Massachusetts Institute of Technology for downstream pipelines including fine-mapping, polygenic score construction, and Mendelian randomization analyses. Educational and outreach efforts collaborate with organizations like ELIXIR and Global Alliance for Genomics and Health.

Usage and impact

The Catalog underlies secondary analyses in studies by consortia such as GIANT, CARDIoGRAMplusC4D, Psychiatric Genomics Consortium, and institutions including University College London and Harvard Medical School. It informs translational pipelines used by companies and initiatives like Regeneron Pharmaceuticals, 23andMe, and public health genetics programs at ministries and agencies in jurisdictions worldwide. Applications include identification of drug targets that reference approvals by bodies like the European Medicines Agency and the U.S. Food and Drug Administration, as well as incorporation into risk prediction models validated in cohorts such as the UK Biobank and All of Us Research Program.

Limitations and challenges

Challenges include representativeness bias due to over-representation of populations from countries associated with institutions such as United Kingdom, United States, and Iceland and cohorts like UK Biobank and deCODE genetics, variant annotation inconsistencies across genome builds maintained by groups like Genome Reference Consortium, and harmonization difficulties when integrating summary statistics from diverse consortia such as MAGIC (consortium) and DIAGRAM (consortium). Ethical, legal, and social considerations intersect with guidelines from organizations like the Council for International Organizations of Medical Sciences and Global Alliance for Genomics and Health regarding consent and data sharing.

Category:Biological databases