GENCODE — LLMpedia

GENCODE
Name	GENCODE
Established	2003
Discipline	Genomics
Country	International
Institution	European Bioinformatics Institute; Wellcome Sanger Institute; National Human Genome Research Institute; University of California, Santa Cruz; EMBL; Broad Institute

Contents

Overview
History and Development
Goals and Scope
Data and Annotation Methods
Releases and Versions
Usage and Impact
Access and Tools

GENCODE

GENCODE is a collaborative project that provides high-quality reference gene annotation for the human and mouse genomes. It integrates manual curation and automated annotation to support research communities such as those involved with the Human Genome Project, the 1000 Genomes Project, the ENCODE Project, and the International Human Epigenome Consortium. Major partners include the European Bioinformatics Institute, the Wellcome Sanger Institute, the National Human Genome Research Institute, University of California Santa Cruz, EMBL, and the Broad Institute.

Overview

GENCODE produces comprehensive gene sets that include protein-coding genes, non-coding RNAs, pseudogenes, and alternative transcripts to serve as reference annotations for projects like the Human Genome Project, the ENCODE Project, the 1000 Genomes Project, the International HapMap Project, and the Cancer Genome Atlas. Its outputs are used by databases and resources such as Ensembl, the University of California Santa Cruz Genome Browser, RefSeq, UniProt, EMBL-EBI, NCBI, and model organism resources coordinated with institutions like the Wellcome Trust Sanger Institute and the Broad Institute. The project interfaces with standards and consortia including the Global Alliance for Genomics and Health, the International Cancer Genome Consortium, the Encyclopedia of DNA Elements, and the Mouse Genome Informatics resource.

History and Development

GENCODE began in the early 2000s as part of efforts to refine reference gene annotation established during the Human Genome Project and was subsequently shaped by collaborations involving the European Bioinformatics Institute, the Wellcome Sanger Institute, and the National Human Genome Research Institute. It evolved alongside major initiatives such as ENCODE, the 1000 Genomes Project, the International HapMap Project, and the Mammalian Gene Collection, and drew on methodologies developed at the University of California Santa Cruz, EMBL, and the Broad Institute. Over time GENCODE adopted pipelines and validation strategies influenced by projects like the Cancer Genome Atlas, the Genotype-Tissue Expression Consortium, the ENA, and resources maintained by NCBI, UniProt, and Ensembl. Leadership and advisory input have come from investigators affiliated with institutions including Harvard University, Stanford University, Massachusetts Institute of Technology, and the Sanger Centre.

Goals and Scope

GENCODE aims to produce a near-complete annotation of human and mouse gene structures to support clinical, evolutionary, and functional genomics research undertaken by groups such as the International Human Epigenome Consortium, the Human Cell Atlas, the Cancer Genome Atlas, and the Human Microbiome Project. It seeks to reconcile automated pipelines developed by Ensembl and NCBI RefSeq with manual curation performed by expert teams associated with EMBL-EBI, the Wellcome Sanger Institute, and university partners including UC Santa Cruz, Yale University, and Johns Hopkins University. The project scope encompasses integration with protein resources like UniProt, variant resources like ClinVar, population resources like gnomAD, and model organism databases such as Mouse Genome Informatics and FlyBase.

Data and Annotation Methods

GENCODE annotation combines experimental evidence from RNA-seq, CAGE, long-read sequencing (PacBio, Oxford Nanopore), mass spectrometry, and comparative genomics informed by projects including GTEx, ENCODE, the 1000 Genomes Project, the International Mouse Phenotyping Consortium, and the Human Proteome Project. Automated annotation pipelines originate from Ensembl and NCBI collaborations, while manual curation leverages expertise from EMBL-EBI, the Wellcome Sanger Institute, the Broad Institute, and university groups at Stanford, Harvard Medical School, and UC Santa Cruz. Validation draws on datasets from dbSNP, ClinVar, COSMIC, PeptideAtlas, PRIDE, and UniProt, and integrates functional annotations linked to Gene Ontology and pathway resources such as Reactome and KEGG.

Releases and Versions

GENCODE issues periodic releases that align with Ensembl and UCSC Genome Browser assemblies, coordinating with genome builds from the Genome Reference Consortium, including GRCh37 and GRCh38 for human and GRCm38 and GRCm39 for mouse. Each release is cross-referenced with resources like RefSeq, UniProtKB, Ensembl, NCBI Genome Data Viewer, and the UCSC Genome Browser, and is cited in large-scale studies from consortia such as ENCODE, GTEx, TCGA, and the 1000 Genomes Project. Releases have been used as the annotation standard in projects hosted by EMBL-EBI, the Wellcome Sanger Institute, the Broad Institute, and national centers including NIH and the European Commission–funded infrastructures.

Usage and Impact

GENCODE annotations underpin analyses in clinical genomics pipelines at institutions like Mayo Clinic, Johns Hopkins Hospital, and the Broad Institute, and are integral to population genomics efforts led by gnomAD, the 1000 Genomes Project, and the International HapMap Project. They inform cancer variant interpretation in projects such as TCGA and ICGC, transcriptomics analyses in GTEx and the Human Cell Atlas, and proteogenomics studies within the Human Proteome Organization and the Human Proteome Project. GENCODE outputs are widely cited in publications from Nature, Science, Cell, Genome Research, and PLOS journals, and are used by bioinformatics tool developers at EMBL-EBI, UCSC, Ensembl, and Bioconductor.

Access and Tools

GENCODE data are accessible through Ensembl, the UCSC Genome Browser, NCBI resources, and EMBL-EBI portals, and are integrated into tools and platforms maintained by the Broad Institute, Wellcome Sanger Institute, and university groups at Stanford, Harvard, and UC Santa Cruz. Commonly used software and services that consume GENCODE annotations include Galaxy, Bioconductor packages, BEDTools, SAMtools, IGV, ANNOVAR, VEP, and Skyline, and the data are incorporated into cloud platforms operated by Amazon Web Services, Google Cloud Platform, and national bioinformatics infrastructures. Training materials and community outreach are provided in collaboration with organizations such as ELIXIR, the Global Alliance for Genomics and Health, EMBL-EBI training, and Cold Spring Harbor Laboratory.

Category:Genomics