GRCh38 — LLMpedia

GRCh38
Name	GRCh38
Organism	Human (Homo sapiens)
Assembly level	Reference assembly
Released	2013
Center	National Human Genome Research Institute / Genome Reference Consortium
Accession	GCA_000001405.15

Contents

Background and development
Genome assembly structure and features
Improvements over GRCh37
Alternate loci, patches, and decoy sequences
Annotation and usage in research and clinical genomics
Limitations and future updates

GRCh38 is the 38th human genome reference assembly released by the consortium responsible for producing standardized human sequence resources. It serves as a coordinated coordinate system used across biomedical research, clinical genomics, and population studies, integrating curated sequence from multiple sequencing centers and public resources. The assembly underpins variant calling, gene annotation, and comparative genomics work performed by groups at major institutions and consortia.

Background and development

The assembly was produced by the Genome Reference Consortium in collaboration with the National Human Genome Research Institute, European Molecular Biology Laboratory, Wellcome Sanger Institute, and other partners such as Broad Institute and National Center for Biotechnology Information. Development incorporated curation from legacy projects including the Human Genome Project, sequence contributed by the 1000 Genomes Project, and resources developed at University of California, Santa Cruz and Ensembl. Leadership and technical oversight involved staff from the National Institutes of Health and contributors associated with the International HapMap Project and clinical groups like ClinGen. Public release and update policies aligned with standards set by bodies such as the Global Alliance for Genomics and Health and relied on community feedback from groups including the American College of Medical Genetics and Genomics.

Genome assembly structure and features

The assembly organizes euchromatic and heterochromatic regions of chromosome 1, chromosome 2, chromosome X, chromosome Y, and autosomes into linear sequences with coordinate systems compatible with tools developed at The European Bioinformatics Institute, University of California, Santa Cruz, and the Broad Institute's pipelines. GRCh38 introduced modeled centromeric sequences, represented regions for mitochondrial DNA and alternate representations for polymorphic loci used by projects like ExAC and gnomAD. The assembly is distributed with FASTA and AGP files processed by the National Center for Biotechnology Information and indexed for use by aligners from groups such as Burrows–Wheeler Aligner developers and variant callers associated with Genome Analysis Toolkit authors. Supportive annotations were integrated from resources maintained by GENCODE, RefSeq, and databases like UniProt and dbSNP.

Improvements over GRCh37

Compared with the prior build produced in coordination with groups such as 1000 Genomes Project and databases maintained by Ensembl, GRCh38 corrected misassembled regions identified in earlier work by teams at Wellcome Sanger Institute and remedied errors reported by clinical groups including ClinVar. Improvements included better representation of centromeres and telomeres influenced by studies from Telomere-to-Telomere Consortium, corrected gaps informed by long-read sequencing from Pacific Biosciences and Oxford Nanopore Technologies, and removal of redundant and mislocalized sequences flagged by computational analyses at Broad Institute and National Institutes of Health. These updates improved read mapping for aligners used in pipelines developed at University of California, Santa Cruz and accuracy for variant interpretation workflows used by European Molecular Biology Laboratory-affiliated groups.

Alternate loci, patches, and decoy sequences

To represent population variation and complex loci, the assembly includes alternate loci and patches analogous to resources produced by 1000 Genomes Project and structural variant catalogs curated by dbVar. Alternate scaffolds provide sequence representations for medically relevant regions characterized by teams at Wellcome Sanger Institute and the Broad Institute, while patch releases addressed specific issues reported by clinical groups such as ClinGen and diagnostic laboratories. Decoy sequences bundled with the assembly were informed by contamination and misalignment studies from centers like National Center for Biotechnology Information and serve to reduce false-positive mappings in aligners from groups including developers of Bowtie and BWA.

Annotation and usage in research and clinical genomics

GRCh38 is the coordinate backbone for annotations produced by GENCODE, RefSeq, and projects such as ENCODE and Roadmap Epigenomics Project, enabling transcript models, regulatory element maps, and variation catalogs used in genome-wide association studies by consortia like GIANT and clinical variant curation by ClinVar. Clinical laboratories following guidelines from American College of Medical Genetics and Genomics and Association for Molecular Pathology rely on the assembly for diagnostic pipelines, reporting variants in resources such as OMIM' entries curated by groups at Johns Hopkins University and variant frequency data from gnomAD. Large cohort studies from institutions including UK Biobank and the All of Us Research Program utilize the assembly for imputation, association testing, and integrative analyses with proteomics and metabolomics datasets produced by collaborative centers.

Limitations and future updates

Despite improvements, the assembly does not fully resolve highly repetitive regions characterized by studies at the Telomere-to-Telomere Consortium and complex structural variation cataloged by dbVar and DGV. Representation of population diversity remains incomplete compared with initiatives like the Human Pangenome Reference Consortium and regional projects such as 1000 Genomes Project population expansions. Ongoing efforts by groups at National Human Genome Research Institute, European Molecular Biology Laboratory, and commercial sequencing developers aim to integrate near-complete haplotype-resolved assemblies from long-read platforms developed by Pacific Biosciences and Oxford Nanopore Technologies, with future releases expected to reduce reference bias highlighted by studies from Broad Institute and improve clinical interpretation pipelines used by ClinGen.

Category:Genome assemblies