Relative entropy — LLMpedia

Relative entropy
Name	Relative entropy
Caption	Kullback–Leibler divergence visualization
Field	Information theory, Statistics, Probability theory
Introduced	1951
Inventor	Solomon Kullback; Richard Leibler
Notation	D_{KL}(P\

Contents

Definition and basic properties
Relation to other divergence measures
Interpretation and operational meaning
Mathematical properties and inequalities
Applications in information theory and statistics
Computational methods and estimators

Relative entropy

Relative entropy, commonly denoted D_{KL}(P||Q), is a measure of dissimilarity between two probability distributions introduced by Solomon Kullback and Richard Leibler in 1951. It appears across work by Claude Shannon, Norbert Wiener, Harold Jeffreys, Andrey Kolmogorov, and Alfréd Rényi and connects to foundational results by Alan Turing and John von Neumann. Used in fields influenced by Paul Erdős, Andrey Markov, Émile Borel, and Andrey Kolmogorov, relative entropy underpins results in the traditions of David Blackwell, Thomas Cover, Joy Thomas, Imre Csiszár, and C. R. Rao.

Definition and basic properties

Relative entropy is defined for two probability measures P and Q on a common measurable space when P is absolutely continuous with respect to Q; formally D_{KL}(P\|Q)=\int \log(dP/dQ)\,dP. Early expositions by Claude Shannon and formalizations by Andrey Kolmogorov and Harold Jeffreys motivated this Radon–Nikodym formulation. Fundamental properties include nonnegativity (Gibbs' inequality), convexity in the pair (linked to work by Hermann Weyl and John von Neumann), and additivity under product measures (seen in constructions by Ralph Fox and developed in axiomatic systems related to E. T. Jaynes). Relative entropy is asymmetric, vanishing iff P=Q almost everywhere, and may be infinite when supports differ, a phenomenon studied by Paul Lévy and J. Willard Gibbs.

Relation to other divergence measures

Relative entropy relates to many divergences studied by Alfréd Rényi, I. J. Good, and Hugo Touchette. The Rényi divergence D_α(P\|Q) generalizes D_{KL} as α→1 (work by Alfréd Rényi), while the Hellinger distance linked to Émile Borel and the total variation distance studied by Andrey Kolmogorov and Pafnuty Chebyshev admit bounds in terms of D_{KL}. Csiszár f-divergences (introduced by Imre Csiszár) encompass relative entropy as a special case; relationships with the χ^2-divergence analyzed by C. R. Rao and Ronald Fisher yield Pinsker-type inequalities attributed to Markov-style concentration results and refined by Sergei Bernstein and Andrey Kolmogorov-inspired tail bounds. The symmetrized Kullback–Leibler divergence connects to the Jensen–Shannon divergence studied by Richard Hall and popularized in machine learning work following Geoffrey Hinton.

Interpretation and operational meaning

Operationally, relative entropy appears in coding theorems of Claude Shannon, hypothesis testing theorems of Abraham Neyman and Egon Pearson, and large deviations principles of Srinivasa Varadhan and Harald Cramér. In source coding, D_{KL}(P\|Q) quantifies the expected excess code length when encoding samples from P with a code optimized for Q, a perspective elaborated by Thomas Cover and Joy Thomas. In binary hypothesis testing, Stein’s lemma (connections to Andrey Kolmogorov and Wassily Hoeffding) identifies D_{KL} as the exponential rate of decay of type II error under a fixed type I constraint. In statistical mechanics, links to free energy and entropy production trace through work by Ludwig Boltzmann, Josiah Willard Gibbs, and Ilya Prigogine; in Bayesian updating, D_{KL} underlies variational approximations used in practice by Radford Neal and David MacKay.

Mathematical properties and inequalities

Mathematical structure of relative entropy has been explored by Imre Csiszár, Thomas Cover, and S. Kullback. Key inequalities include Gibbs' inequality, Pinsker’s inequality (relating D_{KL} to total variation, historically refined by Olivier Rioul and Yorick Zelnik), the data-processing inequality (monotonicity under measurable mappings, with lineage to John von Neumann and Andrey Kolmogorov), and convexity/strict convexity results credited in rigorous form by Hermann Weyl and Léon Brillouin. Chain rules decompose D_{KL} for joint distributions, employed in martingale entropy techniques by Joseph Doob and concentration inequalities developed by Michel Talagrand. Continuity and lower semicontinuity properties follow from functional analysis traditions by Stefan Banach and John von Neumann; duality representations relate to convex conjugates studied by Maurice Fréchet.

Applications in information theory and statistics

Relative entropy is central to channel capacity results of Claude Shannon, to universal coding schemes by Jorma Rissanen, and to model selection criteria influenced by Hirotugu Akaike and George Box. It appears in estimation theory via Kullback–Leibler information for maximum likelihood analyses developed by Ronald Fisher and C. R. Rao, and in empirical process theory connected to Vladimir Vapnik and Alexey Chervonenkis. Machine learning applications span variational inference popularized by Michael Jordan and Radford Neal, generative modeling work following Ian Goodfellow, and feature selection methods inspired by Trevor Hastie and Robert Tibshirani. In neuroscience and cognitive science, D_{KL} measures surprise in predictive coding frameworks influenced by Horace Barlow and Karl Friston; in statistical physics it quantifies irreversibility in fluctuation theorems by Gavin Crooks and Christopher Jarzynski.

Computational methods and estimators

Estimating relative entropy from samples engages density estimation approaches by Andrey Kolmogorov, Vladimir Vapnik, and Bradley Efron. Plug-in estimators based on kernel density estimators trace to work by Murray Rosenblatt and Evariste Gumbel, while k-nearest neighbors estimators build on ideas by Leo Breiman and Larry Wasserman. Bias correction and minimax rates have been analyzed by Ingster and Yann LeCun-era statistical learning theorists; variational methods solving convex dual problems are used in practice by Michael Jordan and David Blei. Recent computational advances leverage stochastic optimization frameworks developed by Leon Bottou and John Duchi, and scalable approximations in deep learning pipelines from Geoffrey Hinton and Yoshua Bengio.

Category:Information theory