Sequence Ontology

Sequence Ontology
Name	Sequence Ontology
Abbreviation	SO
Domain	Genomics
Owner	The Sequence Ontology Project
License	OBO Foundry-compatible
Introduced	2005

Contents

History
Scope and structure
Key terms and definitions
Applications and use cases
Development and maintenance
Adoption and integration with bioinformatics standards

Sequence Ontology is an ontology for annotating nucleotide sequence features and variation in biological sequences. It provides a controlled vocabulary for classifying genomic elements, enabling consistent annotation across projects such as Human Genome Project, 1000 Genomes Project, ENCODE Project, Genome Reference Consortium, and International HapMap Project. The ontology supports interoperability among resources including NCBI, European Bioinformatics Institute, UCSC Genome Browser, GENCODE, and RefSeq.

History

The ontology originated in response to annotation inconsistencies noted during work by groups such as Wellcome Trust Sanger Institute, Broad Institute, National Human Genome Research Institute, and researchers involved in the Human Genome Project and GENCODE collaborations. Early contributors included curators and informaticians from FlyBase, WormBase, Ensembl, and UniProt, who sought harmonization across model organism databases and reference assemblies. Subsequent workshops at venues like Cold Spring Harbor Laboratory, European Molecular Biology Laboratory, and meetings of the Open Biomedical Ontologies community shaped its scope and alignment with efforts by the Gene Ontology consortium and the OBO Foundry. Funding and support came from organizations such as the National Institutes of Health, Wellcome Trust, and regional initiatives coordinated with the European Commission research programs.

Scope and structure

The ontology covers sequence-level features including coding and noncoding elements used by projects like GENCODE, RefSeq, and Ensembl. Its hierarchical structure is influenced by principles introduced by the OBO Foundry and integrates with ontologies such as Gene Ontology, Protein Ontology, and Phenotype And Trait Ontology. Terms describe features observed in databases maintained by NCBI, EBI, UCSC Genome Browser, DDBJ, and model organism resources such as FlyBase, WormBase, MGI, ZFIN, and SGD. The structure supports relationships (is_a, part_of, derives_from) compatible with tools developed by Open Biomedical Ontologies, OWL-based editors like Protégé (software), and version control via systems used by repositories like GitHub and Bitbucket.

Key terms and definitions

Key terms include classes used to annotate transcripts, exons, introns, promoters, enhancers, untranslated regions, and variation types commonly handled by projects such as 1000 Genomes Project, dbSNP, and ClinVar. Representative named entities present in annotations are derived from standards and resources such as RefSeq, GENCODE, HGNC, UniProt, and Ensembl. Terms model sequence features found in genomes studied by consortia like Human Genome Project, International HapMap Project, and organisms cataloged by FlyBase and WormBase. Definitions align with community vocabularies produced by panels convened at venues such as Cold Spring Harbor Laboratory and conferences like Bioinformatics Open Source Conference.

Applications and use cases

The ontology underpins annotation pipelines at centers such as Broad Institute, Wellcome Trust Sanger Institute, European Bioinformatics Institute, and regional genome centers participating in initiatives like All of Us Research Program and Personalized Medicine Coalition. It enables consistent variant interpretation in clinical resources such as ClinVar and supports variant effect predictors used in tools from groups including Ensembl, NCBI, and independent projects at universities like Stanford University, Massachusetts Institute of Technology, and Harvard University. Integration with visualization platforms such as UCSC Genome Browser and data exchange standards used by consortia like Global Alliance for Genomics and Health facilitates cross-database queries in resources maintained by GenBank, RefSeq, ENA, and model organism databases including MGI and ZFIN.

Development and maintenance

Development has been community-driven with contributors from institutions such as Wellcome Trust Sanger Institute, Broad Institute, European Bioinformatics Institute, NCBI, and academic groups at Stanford University and University of Cambridge. Maintenance practices follow governance models promoted by the OBO Foundry and coordinate term requests, issue tracking, and ontology releases through infrastructure used by organizations like GitHub and community mailing lists anchored by stakeholders including GENCODE and curators from FlyBase and WormBase. Workshops at conferences like ISMB and collaborations with projects such as Gene Ontology ensure ongoing alignment with evolving genomic standards.

Adoption and integration with bioinformatics standards

Adoption spans major repositories and standards bodies: NCBI databases, Ensembl annotation pipelines, UCSC Genome Browser tracks, and clinical resources like ClinVar and dbSNP. Integration efforts align terms with the OBO Foundry principles, mapping to ontologies such as Gene Ontology, Protein Ontology, and Phenotype And Trait Ontology, and interoperating with exchange formats endorsed by Global Alliance for Genomics and Health and data models used by FAANG and population initiatives like 1000 Genomes Project and All of Us Research Program. Tooling for integration leverages software from projects such as Protégé (software), BioPerl, Bioconductor, and BEDTools, enabling pipelines implemented at institutions including Broad Institute, European Bioinformatics Institute, and university research centers.

Category:Bioinformatics