NCBI RefSeq — LLMpedia

NCBI RefSeq
Name	RefSeq
Producer	National Center for Biotechnology Information
Country	United States
First release	2000
Scope	Reference nucleotide and protein sequences
Access	Public

Contents

Overview
Data content and components
Data curation and quality control
Access and tools
Applications and impact

NCBI RefSeq RefSeq is a curated, non-redundant collection of reference nucleotide and protein sequences produced by the National Center for Biotechnology Information. It serves as a standardizing resource for sequence-based annotation used across projects such as the Human Genome Project, the Genome Reference Consortium, and model organism databases. RefSeq integrates data from sources including GenBank submissions, the National Library of Medicine, the National Institutes of Health, the National Center for Biotechnology Information, and collaborations with subject-matter authorities.

Overview

RefSeq provides a single, integrated set of genomic DNA, transcript (RNA), and protein sequences for major organisms used in projects like the Human Genome Project, the 1000 Genomes Project, the ENCODE Project, the International HapMap Project, and the Cancer Genome Atlas. It interfaces with resources such as UniProt, Ensembl, FlyBase, WormBase, Saccharomyces Genome Database, and Mouse Genome Informatics to harmonize identifiers and annotations. RefSeq records are used by consortia including the Genome Reference Consortium, the Global Alliance for Genomics and Health, the Wellcome Sanger Institute, and the Broad Institute to support variant interpretation, comparative genomics, and functional genomics. Major users include clinicians at the Centers for Disease Control and Prevention, researchers at the National Institutes of Health, and bioinformaticians at academic centers such as Harvard Medical School and Stanford University School of Medicine.

Data content and components

RefSeq organizes data into genomic, transcript, and protein products representing species from bacteria and archaea to animals and plants featured in projects like the Earth Microbiome Project, the Human Microbiome Project, the Plant Genome Initiative, the Vertebrate Genomes Project, and the 100K Fungal Genomes Project. Components include curated reference genomes used by the Genome Reference Consortium, RefSeqGene sequences aligned with ClinVar and dbSNP records, curated mRNA and non-coding RNA records cross-referenced to Rfam and miRBase, and protein products annotated with links to UniProtKB, Pfam, PROSITE, and InterPro entries. RefSeq records support annotations in resources such as Gene, OMIM, PharmGKB, and GEO, and are integral to pipelines at institutions like the European Bioinformatics Institute and the National Human Genome Research Institute.

Data curation and quality control

Curation combines automated pipelines with manual review by curators collaborating with domain experts from model organism databases such as FlyBase, WormBase, MGI, ZFIN, and TAIR, and with international bodies including the International Nucleotide Sequence Database Collaboration, the Genome Reference Consortium, and the International Nomenclature Committee. Quality control incorporates sequence validation, transcript evidence from projects like GTEx and ENCODE, protein domain validation against Pfam and InterPro, and cross-checks with UniProt and Swiss-Prot annotations. Versioning and accessioning follow policies coordinated with the International Nucleotide Sequence Database Collaboration and link to clinical resources like ClinVar and the Human Gene Mutation Database to ensure traceability for clinical genetics laboratories at institutions such as Mayo Clinic and Johns Hopkins Hospital.

Access and tools

RefSeq data are accessible via NCBI resources including Entrez, BLAST, GenBank, FTP releases, and APIs used by developers at the Broad Institute, EMBL-EBI, Illumina, and Oxford Nanopore Technologies. Visualization and analysis integrate with tools like Genome Data Viewer, UCSC Genome Browser, Integrated Genome Browser, and IGV, and workflows in platforms such as Galaxy, Bioconductor, and Nextflow. Programmatic access is supported by E-utilities, NCBI Datasets, and cloud-hosted distributions used in collaborations with Amazon Web Services, Google Cloud, and Microsoft Azure for large-scale projects like the Cancer Genome Atlas and the All of Us Research Program.

Applications and impact

RefSeq underpins clinical variant annotation pipelines used by laboratories implementing guidelines from the American College of Medical Genetics and Genomics, supports pathogen surveillance efforts at the Centers for Disease Control and Prevention and the World Health Organization, and aids conservation genomics projects associated with the Smithsonian Institution and the Royal Society. It is foundational to comparative studies by groups at the Wellcome Sanger Institute and the Broad Institute, to transcriptomics analyses in the ENCODE Project and GTEx Consortium, and to drug target validation work at pharmaceutical companies such as Pfizer, Merck, and AstraZeneca. RefSeq’s standardized identifiers facilitate data exchange among repositories like UniProt, Ensembl, ClinVar, and dbSNP, amplifying its role across biomedical research, public health response, and translational medicine initiatives at institutions including the National Cancer Institute and the European Molecular Biology Laboratory.

Category:Biological databases