Variant Call Format

Variant Call Format
Name	Variant Call Format
Extension	.vcf
Type	Text-based bioinformatics format
Introduced	2009
Owner	1000 Genomes Project
Mime	text/vcf
Latest release	4.3

Contents

Introduction
File format specification
Usage and tools
Common fields and annotations
Examples and workflows
Limitations and extensions

Variant Call Format

Variant Call Format is a text file format for storing gene variation calls and associated annotations from high-throughput sequencing projects. It was developed to represent single nucleotide variants, insertions, deletions and structural variants produced by projects such as the 1000 Genomes Project, the International HapMap Project, and clinical sequencing initiatives at institutions like the Broad Institute and Wellcome Sanger Institute. VCF files are widely used in pipelines involving tools from groups including the Genome Analysis Toolkit team, the European Bioinformatics Institute, and commercial platforms from companies such as Illumina.

Introduction

VCF was devised to provide a compact, extensible representation of variant calls alongside metadata describing the sample set, reference assembly, and annotation conventions. Early adopters included consortia like the 1000 Genomes Project and the Human Genome Project follow-on efforts, while implementers and maintainers have included teams from the Broad Institute, European Bioinformatics Institute, and academic groups at Wellcome Sanger Institute. The format is versioned (for example 4.1, 4.2, 4.3) and is commonly distributed with companion tools such as those from the Genome Analysis Toolkit and bcftools.

File format specification

A VCF file is composed of a header section followed by variant records. The header uses lines beginning with '##' to convey metadata such as reference assembly identifiers (for example GRCh37 or GRCh38) and INFO/FILTER/FORMAT field definitions. A single-line header beginning with '#' defines column labels (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, and sample columns). Records encode genomic coordinates relative to a reference like GRCh38 and may use standardized identifiers from resources such as dbSNP or clinical assertions from ClinVar. Structural variant representation follows conventions that interact with specifications from groups including the Global Alliance for Genomics and Health.

Usage and tools

VCF files are generated and consumed by variant callers and annotators developed by projects and organizations including the Genome Analysis Toolkit, FreeBayes, SAMtools, and Platypus. Downstream annotation and filtering commonly use resources and software from the Ensembl project, the UCSC Genome Browser, and commercial annotation suites from companies such as Illumina and Thermo Fisher Scientific. File manipulation and querying are performed with utilities like bcftools, vcftools, and workflow systems such as Nextflow and Snakemake. Clinical pipelines integrate VCF handling with standards and agencies such as the Food and Drug Administration and certification bodies in regional health systems.

Common fields and annotations

Standard VCF INFO fields convey allele frequency, functional consequence, and quality metrics, often populated from annotations referencing databases and ontologies like dbSNP, gnomAD, ClinVar, and the Sequence Ontology. FORMAT fields capture per-sample genotype representations (GT), genotype quality (GQ), and read depth (DP) with values produced by callers such as GATK and FreeBayes. FILTER fields document variant-level quality assessments and may reference criteria defined by consortia like the Global Alliance for Genomics and Health or institutional policies at centers such as the Broad Institute and Wellcome Sanger Institute. Annotation pipelines commonly incorporate effect prediction tools from groups behind SnpEff, VEP (Variant Effect Predictor), and curated knowledgebases like UniProt.

Examples and workflows

Typical workflows begin with alignment produced by mappers such as BWA or Bowtie 2, proceed to variant calling with tools like GATK HaplotypeCaller or FreeBayes, and produce VCF output which is then filtered by vcftools or bcftools and annotated by VEP (Variant Effect Predictor) or SnpEff. Large-scale projects such as the 1000 Genomes Project, the Exome Aggregation Consortium, and clinical sequencing programs at centers like Broad Institute demonstrate production workflows that include joint genotyping, VCF normalization with utilities like vt and merging with resources such as dbSNP and gnomAD for population frequency annotations. Visualization and interpretation integrate browsers and portals maintained by UCSC Genome Browser, Ensembl, and clinical platforms used by hospitals and diagnostic laboratories.

Limitations and extensions

VCF was designed for small variants and has limitations representing complex haplotypes, multi-allelic sites, and nested structural variants; these challenges have led to extensions and complementary formats developed by initiatives such as the Global Alliance for Genomics and Health and tools like bcftools that implement phased genotype conventions. For large structural variants and pangenome representations, projects including the Telomere-to-Telomere Consortium and the Human Pangenome Reference Consortium explore alternative graph-based formats and annotations, while standards bodies and repositories like the European Bioinformatics Institute and NCBI coordinate versioning and accessioning practices. Interoperability with clinical reporting requires alignment with regulatory frameworks overseen by agencies such as the Food and Drug Administration and national health authorities.

Category:Bioinformatics file formats