Principal component analysis

Principal component analysis
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Principal component analysis
Invented by	Karl Pearson
Introduced	1901
Field	Statistics

Contents

Introduction
Mathematical formulation
Computation and algorithms
Properties and interpretation
Applications
Extensions and related methods

Principal component analysis is a statistical technique for dimensionality reduction, data compression, and feature extraction that transforms correlated variables into a set of linearly uncorrelated components. It was introduced by Karl Pearson and later reformulated by Harold Hotelling; PCA links to many developments in Adrien-Marie Legendre's work on least squares and to modern applications across IBM, Google, Microsoft Research, Stanford University, and Massachusetts Institute of Technology. PCA underpins methods used in projects at the European Organization for Nuclear Research, NASA, National Institutes of Health, Siemens, and in algorithms from Facebook, Amazon (company), and Apple Inc..

Introduction

PCA seeks orthogonal directions that capture maximal variance in multivariate data, summarizing high-dimensional measurements from experiments at institutions like Bell Labs, Los Alamos National Laboratory, and SRI International. Early theoretical roots connect to work by Arthur Cayley and James Joseph Sylvester on linear algebra, and practical adoption accelerated with computing advances at ENIAC and software from Bell Telephone Laboratories and IBM Research. PCA is routinely taught in courses at University of Cambridge, University of Oxford, Harvard University, Princeton University, and Yale University and appears in curricula alongside topics from John von Neumann and Alan Turing.

Mathematical formulation

Given a dataset represented by a matrix, PCA finds eigenvectors of the covariance or scatter matrix, a formulation linked to eigenvalue problems studied by David Hilbert, others, and numerical analysis developed by Carl Friedrich Gauss. The principal components are orthogonal eigenvectors associated with descending eigenvalues; this spectral decomposition is central in work by Eugene Wigner and in matrix theory by Gilbert Strang. Alternative derivations use singular value decomposition (SVD), a factorization with foundations in contributions by Issai Schur and formalized through algorithms credited to researchers at AT&T Bell Laboratories and later refined by teams at National Bureau of Standards and Numerical Recipes contributors.

Computation and algorithms

Practical computation of PCA uses algorithms for SVD and eigen-decomposition implemented in libraries from LAPACK, BLAS, and statistical packages from R (programming language), MATLAB, NumPy, and SciPy. For very large datasets, randomized algorithms developed by researchers at Google Research, Facebook AI Research, and IBM Watson use sketching and streaming; these methods trace to theoretical work by Noga Alon, Peter Shor, and Santosh Vempala. Incremental and online PCA algorithms are applied in signal processing systems designed by Siemens and Ericsson, while parallel implementations exploit architectures from NVIDIA and Intel Corporation and distributed frameworks like Apache Spark.

Properties and interpretation

Principal components are uncorrelated linear combinations that maximize variance and minimize reconstruction error under mean-square loss, linking PCA to regression insights from Francis Galton and errors-in-variables discussions influenced by W. Edwards Deming. PCA is optimal under Gaussian assumptions, a property used in analyses at Centers for Disease Control and Prevention and World Health Organization epidemiological studies. Interpretation of loadings and scores often involves domain expertise from teams at CERN, NOAA, and US Geological Survey, while hypothesis testing for component significance draws on methods associated with Ronald Fisher and tests used in laboratories at Cold Spring Harbor Laboratory.

Applications

PCA is applied broadly: in image compression pioneered at Bell Labs and used in projects at Kodak and Adobe Systems; in genetics for population structure analysis at 23andMe, Human Genome Project, and research by Wellcome Trust teams; in finance for risk factor modeling at Goldman Sachs, Morgan Stanley, and J.P. Morgan; in neuroscience at Massachusetts General Hospital and Johns Hopkins University for electrophysiology; and in climate science at Intergovernmental Panel on Climate Change research and National Oceanic and Atmospheric Administration studies. Other domains include chemometrics from DuPont, remote sensing at European Space Agency, speech recognition research at AT&T, and recommender systems developed by Netflix.

Numerous extensions relate to PCA: kernel PCA developed in work at AT&T Bell Laboratories and later explored at INRIA; sparse PCA researched at Princeton University and Columbia University; probabilistic PCA formulated by teams at University College London and University of Toronto; independent component analysis advanced by researchers at CERN and International Business Machines; and nonnegative matrix factorization used at MIT Media Lab and Bell Labs. Other related techniques include factor analysis studied by Karl Marx-era economists? (note: historical economists), manifold learning methods like Isomap and t-SNE popularized by groups at Johns Hopkins University and Laurens van der Maaten's work at University of Tilburg, and multivariate methods used in consortia including ENCODE Project Consortium.

Category:Statistical methods