Kullback–Leibler divergence

Kullback–Leibler divergence
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Kullback–Leibler divergence
Field	Information theory
Introduced	1951
Introduced by	Solomon Kullback; Richard Leibler

Contents

Definition
Properties
Examples
Applications
Estimation and computation
Generalizations and related measures

Kullback–Leibler divergence is an information-theoretic measure that quantifies the difference between two probability distributions and was introduced in 1951 by Solomon Kullback and Richard Leibler. It appears in contexts ranging from statistical inference in the work of Ronald Fisher and Andrey Kolmogorov to coding theory used by Claude Shannon and Norbert Wiener, and it underpins methods employed at institutions such as Bell Labs, IBM, and AT&T Bell Laboratories. The quantity connects to notions developed by John von Neumann, Kurt Gödel, and Alan Turing and finds use in practical systems designed by teams at Microsoft Research, Google Research, and Facebook AI Research.

Definition

The divergence is defined for two probability distributions P and Q over a common domain using an expectation taken with respect to P, a construction related to work by Abraham Wald and Jerzy Neyman; it is expressed in continuous settings with integrals parallel to those used by Émile Borel, Henri Lebesgue, and Andrey Kolmogorov. For discrete distributions the definition uses a sum analogous to combinatorial counts studied by Paul Erdős and George Pólya, while for continuous densities it uses Radon–Nikodym derivatives as in the measure-theoretic frameworks developed by Jacques Hadamard and Maurice Fréchet. The asymmetry of the measure was noted in correspondence between Kullback and Leibler and later formalized by Kurt Gödel and Alonzo Church in logic-related probability discussions.

Properties

The divergence is nonnegative, a statement provable using inequalities attributed to Sergei Bernshtein and Issai Schur and connected to convexity results by Augustin-Louis Cauchy and Rolf Nevanlinna, and it equals zero precisely when the distributions coincide almost everywhere, a condition related to uniqueness theorems of Niels Abel and Évariste Galois. It is not a metric due to asymmetry and failure of the triangle inequality; these contrasts were highlighted in comparative studies by Norbert Wiener and John Nash and discussed in seminars at Princeton University and the Institute for Advanced Study. The divergence satisfies chain rules and decomposition identities akin to those used by Andrei Kolmogorov and Claude Shannon in their respective theories, and it yields bounds such as Pinsker-type inequalities that draw on work by Mark Kac and Paul Lévy.

Examples

Classical examples include the divergence between two Bernoulli distributions, illustrated in textbooks influenced by William Feller and Harald Cramér, and the divergence between multivariate Gaussian densities with different mean vectors or covariance matrices, computations that appear in publications by Harold Jeffreys and Jerzy Neyman. In hypothesis testing scenarios following the Neyman–Pearson framework and the Wald sequential analysis, the divergence quantifies asymptotic error exponents, an approach used in analyses by Abraham Wald, Emil Post, and Jerzy Neyman. Applications to model selection relate to the Akaike Information Criterion developed by Hirotugu Akaike and further discussed by Kenneth Arrow and John Tukey.

Applications

The divergence is used in machine learning algorithms developed at Stanford University, Massachusetts Institute of Technology, and Carnegie Mellon University, including variational inference methods advanced by David MacKay and Michael Jordan and employed in software from OpenAI, DeepMind, and NVIDIA. It appears in natural language processing pipelines influenced by Noam Chomsky and Geoffrey Hinton and in image processing systems originating from work at Bell Labs and Xerox PARC. In bioinformatics it aids sequence analysis in projects at the European Bioinformatics Institute and the National Institutes of Health, and in econometrics it supports methods taught at London School of Economics and University of Chicago.

Estimation and computation

Estimators for the divergence have been studied in asymptotic frameworks by Sergey Bernstein and Vladimir Vapnik and implemented using kernel methods influenced by Grace Wahba and Yoav Freund and ensemble methods arising from Leo Breiman. Monte Carlo and importance sampling techniques for numerical approximation are traced to the work of John von Neumann and Stanislaw Ulam and are incorporated into platforms by Amazon Web Services and Google Cloud. Efficient algorithms for high-dimensional settings draw on dimensionality-reduction schemes by Geoffrey Hinton and Yann LeCun and on randomized algorithms introduced by Donald Knuth and Peter Shor.

Generalizations include f-divergences introduced in analyses by Csiszár and others and Renyi divergence proposed by Alfréd Rényi, both of which relate to earlier entropy concepts from Ludwig Boltzmann and J. Willard Gibbs; the Jensen–Shannon divergence, linked to work by Johan Jensen and William Shannon, symmetrizes and smooths the original measure and is used in practice at institutions such as MIT and Caltech. Connections exist to Fisher information matrices from Ronald Fisher and information geometry developed by Shun-ichi Amari and David Lindley, and to statistical distances like Hellinger distance studied by Ernst Hellinger and Le Cam’s theory by Lucien Le Cam.

Category:Information theory