PhastCons

PhastCons
Name	PhastCons
Developer	Ewan Birney, Adam Siepel, Anders Krogh, Manolis Kellis
Released	2002
Programming language	C (programming language), Perl, Python (programming language)
Operating system	Unix, Linux, macOS, Windows
Genre	Comparative genomics, Sequence conservation
License	Open-source software

Contents

Overview
Methodology
Applications
Performance and Validation
Limitations and Criticisms
Implementation and Software
Related Metrics and Extensions

PhastCons is a computational tool for identifying conserved elements in multiple sequence alignments across species using a phylogenetic hidden Markov model. Developed within the context of comparative genomics projects such as ENCODE Project, 1000 Genomes Project, and work from groups at institutions like the Broad Institute, European Bioinformatics Institute, and MIT, PhastCons has become a standard in comparative sequence analysis. The tool integrates phylogenetic models from evolutionary biology and diffusion processes inspired by work in statistical genetics to score blocks of nucleotide conservation across vertebrates, mammals, insects, and other clades.

Overview

PhastCons uses a two-state phylogenetic hidden Markov model combining a conserved state and a non-conserved state to segment alignments into conserved elements. It builds on phylogenetic models used in studies by authors affiliated with Harvard University, University of California, Berkeley, Stanford University, University of Washington, and integrates likelihood frameworks related to the Felsenstein pruning algorithm and models from Jukes–Cantor model and HKY85 model. The method has been applied in large consortia analyses including UCSC Genome Browser annotations and comparative studies across taxa gathered by projects at Wellcome Sanger Institute and Max Planck Institute for Evolutionary Anthropology.

Methodology

PhastCons trains a phylogenetic model for aligned columns using maximum likelihood or empirical parameters derived from phylogenies like those produced by RAxML, PhyML, MrBayes, or BEAST (software). It uses a hidden Markov model topology similar to frameworks from Eddy Lab and borrows transition modeling practices familiar in tools such as HMMER and Glimmer. The conserved-state substitution rates are modeled as scaled versions of a neutral model, employing scaling parameters analogous to rate multipliers used in PAML and likelihood ratio testing as in codeml. The program outputs probabilistic basewise conservation scores and discrete conserved-element intervals comparable to annotations from GENCODE and alignments hosted by UCSC Genome Browser panels like the Multiz alignments.

Applications

PhastCons has been used to annotate conserved noncoding elements in genomes analyzed by the ENCODE Project, to prioritize variants in clinical studies associated with Human Genome Project resources and to correlate conservation with functional genomics assays from Roadmap Epigenomics Consortium and GTEx Project. It supports evolutionary analyses in comparative studies involving species such as Homo sapiens, Mus musculus, Drosophila melanogaster, Danio rerio, Arabidopsis thaliana, and microbial datasets from Saccharomyces cerevisiae research groups. Conservation calls inform studies in regulatory genomics tied to loci examined by CRISPR (clustered regularly interspaced short palindromic repeats) editing work, population genetics inquiries connected to HapMap Project datasets, and conservation-aware motif discovery as done with tools like MEME Suite.

Performance and Validation

PhastCons performance has been benchmarked in comparative studies against methods including GERP++, PhyloP, SiPhy, and conservation scoring approaches from the UCSC Genome Browser pipelines. Validation often uses experimental datasets from ChIP-seq peaks annotated by consortia such as ENCODE Project and functional assays from labs at National Institutes of Health and European Molecular Biology Laboratory. Statistical evaluation employs receiver operating characteristic curves and precision–recall analyses similar to those used in machine learning comparisons at conferences like NeurIPS and ISMB. Cross-validation studies involving genomes curated by RefSeq and annotations by UniProt have informed parameter choices and false discovery rate control strategies.

Limitations and Criticisms

Critiques of PhastCons highlight sensitivity to alignment quality, phylogeny misspecification, and limitations in detecting rapidly evolving functional elements, echoing concerns raised in literature from groups at Cold Spring Harbor Laboratory and Sanger Institute. The model’s binary conserved/nonconserved state can miss gradations of selective pressure discussed in works from Michael Lynch and Motoo Kimura influenced population genetics theory, and it may conflate constraint with compositional bias as explored in research by Ewan Birney collaborators. Users must be cautious when applying PhastCons to genomes with incomplete assemblies from projects like early Genome 10K efforts, or to datasets with alignment artifacts common in repeats annotated by RepeatMasker.

Implementation and Software

PhastCons is distributed as part of the PHAST (Phylogenetic Analysis with Space/Time models) package developed by researchers affiliated with Harvard University and collaborators at institutions including Riken, Cold Spring Harbor Laboratory, and Broad Institute. Implementations are written in C (programming language) with scripting interfaces in Perl and Python (programming language), and commonly integrated into pipelines using workflow managers like Snakemake and Nextflow. The software reads alignments in formats used by UCSC Genome Browser tools (e.g., MAF) and interoperates with alignment generators such as MAFFT, Clustal Omega, and MUSCLE.

Related conservation metrics and extensions include PhyloP for basewise acceleration/conservation scoring, GERP++ for rejected substitution estimates, SiPhy for identifying constrained elements via substitution patterns, and comparative frameworks implemented in CONDEL and FunSeq. Extensions coupling conservation scores with machine learning frameworks draw on methods from TensorFlow and scikit-learn for integrative regulatory element prediction used in studies by groups at Google DeepMind and university bioinformatics cores. Comparative annotation projects at institutions like UCSC Genome Browser, Ensembl, and GENCODE routinely incorporate PhastCons outputs alongside these related metrics.

Category:Comparative genomics