SHAPEIT — LLMpedia

SHAPEIT
Name	SHAPEIT
Title	SHAPEIT
Developer	Broad Institute; Wellcome Trust Sanger Institute; University of Oxford
Released	2007
Latest release	SHAPEIT4 (2019)
Programming language	C++
Operating system	Linux; macOS; Windows (via WSL)
License	BSD-style

Contents

Introduction
Methodology
Applications
Software Implementation
Performance and Accuracy
History and Development

SHAPEIT SHAPEIT is a statistical phasing and haplotype estimation tool used in human genetics for reconstructing chromosomal phase from genotype data. It is widely used alongside resources and projects such as the 1000 Genomes Project, the UK Biobank, the International HapMap Project, GWAS consortia, and population cohorts curated by institutions like the Broad Institute, the Wellcome Trust Sanger Institute, the European Bioinformatics Institute, and the University of Oxford. The software interfaces with formats and tools including PLINK, VCF, BGEN, and imputation servers maintained by groups at the Michigan Imputation Server, the Haplotype Reference Consortium, and the European Genome-phenome Archive.

Introduction

SHAPEIT performs statistical phasing to infer haplotypes from genotype arrays or sequencing-derived genotypes, enabling downstream analyses in studies associated with Wellcome Trust, National Institutes of Health, Wellcome Trust Sanger Institute, European Molecular Biology Laboratory, Max Planck Society, and disease consortia like Cancer Genome Atlas and Psychiatric Genomics Consortium. Phasing produced by SHAPEIT supports imputation against reference panels such as the 1000 Genomes Project, the Haplotype Reference Consortium, and population-specific panels curated by the Estonian Biobank and the deCODE genetics database. The tool is referenced in pipelines alongside software from GATK, BCFtools, and SAMtools and is applied in studies published in journals like Nature Genetics, Nature, Science, PLoS Genetics, and The American Journal of Human Genetics.

Methodology

SHAPEIT implements a haplotype estimation algorithm based on the Li and Stephens model and uses a Hidden Markov Model (HMM) framework motivated by methods developed in computational genetics by groups at University of Oxford, University College London, University of Cambridge, and the Wellcome Trust Sanger Institute. It exploits pre-phased reference panels such as the 1000 Genomes Project and the Haplotype Reference Consortium to model recombination and mutation events, incorporating maps from the HapMap Project and recombination rate estimates derived from work at the International HapMap Project and researchers affiliated with the University of Southern California. The core algorithm performs Markov Chain Monte Carlo (MCMC) sampling or deterministic approximations depending on version, integrating ideas from statistical approaches used in projects led by investigators at the Broad Institute, Max Planck Institute for Informatics, and the European Bioinformatics Institute. SHAPEIT supports conditioning on identity-by-descent segments discovered with tools developed by teams at the Hinxton Campus and methods influenced by the International HapMap Consortium.

Applications

Researchers use SHAPEIT in pipelines for genotype imputation for cohorts such as the UK Biobank, the Estonian Biobank, and the FinnGen project, for fine-mapping in studies by the Genetic Investigation of Anthropometric Traits consortium and for population genetics analyses associated with the Human Genome Diversity Project, Simons Genome Diversity Project, and regional projects coordinated by institutions like the Chinese Academy of Sciences and the Max Planck Institute for Evolutionary Anthropology. It is essential in studies of disease association coordinated by the International Cancer Genome Consortium and the Psychiatric Genomics Consortium, and in efforts at the National Human Genome Research Institute to catalog variation. SHAPEIT outputs feed into imputation services offered by the Michigan Imputation Server and analytical frameworks used by the Broad Institute in large-scale meta-analyses submitted to journals such as Nature Communications and Genome Research.

Software Implementation

SHAPEIT is implemented primarily in C++ and is distributed as command-line binaries and source code by teams at the Wellcome Trust Sanger Institute, the Broad Institute, and collaborators at the University of Oxford. It interoperates with file formats standardized by the 1000 Genomes Project and tools like PLINK, bcftools, VCFtools, and imputation formats from the Michigan Imputation Server. Successive releases—SHAPEIT2, SHAPEIT3, and SHAPEIT4—introduce multithreading, memory optimizations, and support for large datasets produced by projects such as the UK Biobank and sequencing centers like the Wellcome Trust Sanger Institute and deCODE genetics. Packaging and workflow integration occur via workflow managers and platforms developed at the Broad Institute and by users employing systems like Nextflow, Snakemake, and Cromwell.

Performance and Accuracy

Benchmarking of SHAPEIT versions against alternatives developed by groups at the University of Oxford, the University of Michigan, and the University of Chicago shows trade-offs between speed, memory, and phasing accuracy when applied to datasets from the 1000 Genomes Project, the UK10K consortium, and the Haplotype Reference Consortium. Studies comparing SHAPEIT with methods from teams at the Wellcome Trust Sanger Institute and algorithms used in software like those emerging from the University of Oxford indicate competitive switch error rates and imputation accuracy in analyses presented at conferences such as the American Society of Human Genetics annual meeting and published in Nature Genetics and The American Journal of Human Genetics. Performance improvements in SHAPEIT4 address scalability for cohorts like the UK Biobank and computational infrastructure developed at centers including the European Bioinformatics Institute and the Broad Institute.

History and Development

Development of SHAPEIT began in the late 2000s with researchers affiliated with the Wellcome Trust Sanger Institute and the University of Oxford, in the context of the International HapMap Project and early efforts by the 1000 Genomes Project. Subsequent iterations involved collaboration with scientists at the Broad Institute, the European Bioinformatics Institute, and contributors from institutions such as the University of Cambridge and University College London. The evolution through SHAPEIT2, SHAPEIT3, and SHAPEIT4 reflects methodological advances parallel to progress in reference panel generation by the Haplotype Reference Consortium and large-scale genotyping efforts like the UK Biobank and sequencing performed at the Wellcome Trust Sanger Institute. Ongoing maintenance and updates are coordinated by teams associated with the Wellcome Trust, the Broad Institute, and academic partners at the University of Oxford and reported in publications appearing in venues including Genome Research and Nature Genetics.

Category:Bioinformatics software