GFF3 — LLMpedia

GFF3
Name	GFF3
Extension	.gff, .gff3
Genre	Bioinformatics file format
Released	2001
Latest	3 (formalized)
Owner	Open Bioinformatics community

Contents

Overview
Format Specification
Features and Conventions
Tools and Implementations
Example Files
Adoption and Use Cases
Limitations and Criticisms

GFF3 GFF3 is a plain-text bioinformatics file format used for describing features and annotations on biological sequences. It is widely employed by projects such as the Human Genome Project, Ensembl, UCSC Genome Browser, NCBI, and by research institutes including Wellcome Trust Sanger Institute, European Bioinformatics Institute, Broad Institute, and J. Craig Venter Institute. The format interoperates with tools and standards from organizations like the Open Bioinformatics Foundation, Genome Reference Consortium, and software ecosystems involving BLAST, SAMtools, BEDTools, MAF, and VCF.

Overview

GFF3 serves as a tab-delimited representation to record annotations such as genes, exons, CDS, and regulatory elements on reference sequences produced by consortia like 1000 Genomes Project and initiatives such as ENCODE Project. The format complements sequence repositories including GenBank, RefSeq, and DDBJ and integrates with visualization platforms exemplified by IGV, Artemis, and JBrowse. It is designed for compatibility with scripting environments and languages used in computational biology such as Python (programming language), Perl, R (programming language), and Java (programming language).

Format Specification

GFF3 is line-oriented and tab-delimited, where each non-comment line contains nine fields: seqid, source, type, start, end, score, strand, phase, and attributes. The specification defines structured values for attributes to represent relationships (e.g., Parent, ID) used in hierarchical models like those in Gene Ontology annotations and genome projects by Ensembl Genomes and WGS initiatives. Comment lines begin with a hash and directives such as ##gff-version indicate compliance with the versioning practices familiar to projects such as Global Alliance for Genomics and Health standards. Coordinates are typically 1-based inclusive to match conventions used by databases like RefSeq and browsers like UCSC Genome Browser.

Features and Conventions

GFF3 supports hierarchical relationships among features (for example gene → mRNA → exon → CDS) using attribute keys and values comparable to models employed by Sequence Ontology and annotation pipelines used by groups like GENCODE and RefSeq. The format permits free-text annotations alongside controlled vocabularies, enabling integration with ontologies such as Sequence Ontology, Gene Ontology, and metadata registries like BioProject and BioSample. Conventions for representing strandedness and phase align with tools in the SAM format ecosystem and with assemblers like SPAdes and Velvet.

Tools and Implementations

A broad set of software provides parsing, validation, conversion, and visualization for GFF3, including command-line utilities and libraries maintained by communities around BioPerl, Biopython, BioRuby, BioJava, and the Open Bioinformatics Foundation. Genome browsers such as JBrowse and IGV consume GFF3 for display, while converters exist to transform between GFF3 and formats like BED, GTF, GTF2, EMBL format, and GenBank flat file used by systems like Apollo (genome annotation tool) and Galaxy (platform). Validation tools and validators are provided by projects like the Sequence Ontology consortium and validator scripts are often bundled with annotation pipelines at institutions such as Wellcome Trust Sanger Institute.

Example Files

Typical example GFF3 files appear in training materials from organizations such as European Bioinformatics Institute and workshops by Cold Spring Harbor Laboratory and EMBL-EBI. Educational datasets used in courses at MIT, Stanford University, Harvard University, and UC Berkeley often include small GFF3 snippets illustrating gene models, exon coordinates, and attributes linking to external identifiers like those from UniProt, Ensembl, and NCBI Gene. Public repositories maintained by projects like Ensembl, UCSC Genome Browser, and GENCODE distribute GFF3 exports for model organisms such as Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, and Saccharomyces cerevisiae.

Adoption and Use Cases

GFF3 adoption spans genome annotation pipelines at centers including Broad Institute, Wellcome Trust Sanger Institute, and national initiatives like Genomics England and 1000 Genomes Project. Use cases include deposition of annotation tracks for browsers such as UCSC Genome Browser, exchange of feature sets among collaborators in projects like ENCODE Project and modENCODE, and integration into variant interpretation workflows alongside VCF files used in clinical projects like ClinVar and DECIPHER. It also underpins community annotation platforms like Apollo (genome annotation tool) and supports data exchange in comparative genomics studies by groups such as Phylogenetics research teams.

Limitations and Criticisms

Critics note that GFF3 lacks some formal schema rigidity found in XML-based standards like those from Bioinformatics Open Source Conference-driven XML implementations, and that inconsistent use of attributes by different providers (for example between Ensembl and RefSeq) can complicate interoperability. The free-text nature of some attributes and variable adherence to controlled vocabularies have prompted calls for stronger governance from organizations like Global Alliance for Genomics and Health and tighter integration with ontologies such as Sequence Ontology and Gene Ontology. Performance concerns arise when very large annotation sets from projects like 1000 Genomes Project or whole-genome annotations from Genome Reference Consortium are processed without indexed backends used by databases such as MySQL or PostgreSQL.

Category:Bioinformatics file formats