PolyPhen-2 — LLMpedia

PolyPhen-2
Name	PolyPhen-2
Developer	Adzhubei Lab; Harvard Medical School collaborators
Released	2010
Latest release	2010
Programming language	C++; Python (programming language)
License	Academic

Contents

Introduction
Algorithm and Features
Input, Output, and Interpretation
Performance and Validation
Applications and Use Cases
Limitations and Criticisms

PolyPhen-2 is a computational tool for predicting the potential impact of amino acid substitutions on protein structure and function. It integrates sequence-based and structure-based features to classify missense variants as benign, possibly damaging, or probably damaging. PolyPhen-2 has been widely used in human genetics, clinical genomics, and evolutionary biology studies conducted by groups at institutions such as Stanford University, Massachusetts Institute of Technology, Broad Institute, Harvard University, and University of Cambridge.

Introduction

PolyPhen-2 was developed to address variant interpretation challenges encountered in projects like the 1000 Genomes Project, Human Genome Project, Exome Aggregation Consortium, ClinVar, and clinical sequencing initiatives at hospitals such as Mayo Clinic and Johns Hopkins Hospital. Drawing on comparative approaches used in resources like UniProt, Pfam, and Protein Data Bank, the tool leverages homology and structural information originally popularized by methods from groups at European Bioinformatics Institute and National Center for Biotechnology Information. PolyPhen-2’s outputs are commonly referenced in pipelines alongside annotations from SIFT, CADD (software), MutationTaster, PROVEAN, and databases maintained by Ensembl, UCSC Genome Browser, and dbSNP.

Algorithm and Features

The PolyPhen-2 algorithm combines features derived from amino acid physicochemical properties, multiple sequence alignments from databases such as UniProtKB, and three-dimensional structural parameters from sources like Protein Data Bank and homology models associated with projects at Swiss Institute of Bioinformatics. It computes position-specific independent counts and maps substitutions to structural motifs discovered in work by researchers at European Molecular Biology Laboratory and Cold Spring Harbor Laboratory. The classifier employs a naive Bayes or machine-learning–inspired scoring framework akin to approaches found in studies from Stanford University School of Medicine and Massachusetts General Hospital. Feature sets include conservation metrics comparable to those used by PhyloP and GERP++, solvent accessibility reminiscent of analyses by Rosetta (software), and contact-based assessments similar to those used in publications from University of California, San Francisco.

Input, Output, and Interpretation

Users submit single-nucleotide variants or amino acid substitutions specified with reference coordinates from assemblies such as GRCh37 or GRCh38 and identifiers from resources like RefSeq and Ensembl (database). Outputs include categorical classifications (benign, possibly damaging, probably damaging), numeric scores reflecting the posterior probability of deleteriousness, and annotations referencing alignment depth and structural mapping. These outputs are often interpreted in the context of clinical guidelines promulgated by organizations such as the American College of Medical Genetics and Genomics and integrated into variant review systems used by clinical groups at Cleveland Clinic, Stanford Health Care, and research consortia like ClinGen.

Performance and Validation

Performance metrics for PolyPhen-2 were evaluated against benchmark datasets derived from curated resources including UniProtKB/Swiss-Prot variant annotations, pathogenic lists from OMIM, and neutral polymorphisms identified by projects like the 1000 Genomes Project. Reported sensitivity and specificity figures in the original publications were compared with contemporaneous tools developed at institutions such as Wellcome Sanger Institute and Max Planck Institute for Molecular Genetics. Independent validations performed by groups at Yale University, University College London, and Karolinska Institute have highlighted context-dependent performance, with accuracy varying by gene, domain, and dataset composition akin to observations in meta-analyses from Nature Genetics and Genome Research.

Applications and Use Cases

PolyPhen-2 is applied across human genetics studies in institutions such as Broad Institute, Cold Spring Harbor Laboratory, and clinical centers like Massachusetts General Hospital. Use cases include prioritizing candidate variants in exome sequencing studies from consortia such as Deciphering Developmental Disorders (DDD) and diagnostic pipelines in newborn screening programs influenced by policies at Centers for Disease Control and Prevention. It supports research in evolutionary biology carried out at University of Oxford and comparative genomics projects linked to Max Planck Society, and is incorporated into annotation workflows used by commercial laboratories like 23andMe and clinical sequencing providers affiliated with Invitae.

Limitations and Criticisms

Critiques of PolyPhen-2 echo broader concerns raised by commentators at Nature Reviews Genetics and Genome Biology: reliance on imperfect training sets such as curated pathogenic variants from OMIM and neutral sets from population surveys like the 1000 Genomes Project may bias predictions. Structural mapping limitations reflect gaps in coverage from Protein Data Bank and modeling inaccuracies noted by researchers at Scripps Research Institute and EMBL-EBI. Comparisons with ensemble methods developed at Broad Institute and integrative scores like CADD emphasize that single-tool predictions should not substitute for experimental assays performed in laboratories at Howard Hughes Medical Institute, European Molecular Biology Laboratory, or clinical validation in centers like Beth Israel Deaconess Medical Center. Finally, professional guidelines from American College of Medical Genetics and Genomics caution against sole reliance on in silico predictions for clinical decision-making.

Category:Bioinformatics tools