Fleiss' kappa — LLMpedia

Fleiss' kappa
Name	Fleiss' kappa
Developer	Joseph L. Fleiss
Introduced	1971
Field	Statistics; Psychometrics
Related	Cohen's kappa; Krippendorff's alpha; Intraclass correlation coefficient

Contents

Definition
Mathematical formulation
Properties and interpretation
Estimation and variance
Applications and examples
Limitations and alternatives

Fleiss' kappa is a statistical measure for assessing the reliability of agreement between multiple raters assigning categorical ratings to a fixed number of items. Developed by Joseph L. Fleiss and building on earlier work by Jacob Cohen and others, it generalizes pairwise agreement metrics to the multi-rater scenario and is widely used in empirical studies across disciplines including Medicine, Psychology, Sociology, and Computer Science.

Definition

Fleiss' kappa quantifies the degree to which observed agreement among a set of raters on a set of items exceeds agreement expected by chance, relative to the maximum possible agreement beyond chance. Its origins trace to reliability problems addressed in the mid-20th century by figures such as Jacob Cohen, Leo Goodman, and Herbert A. Simon, and it has been applied in contexts involving institutions like the World Health Organization, National Institutes of Health, and research projects at universities such as Harvard University, Stanford University, and University of Cambridge.

Mathematical formulation

Let N denote the number of items, n the number of raters per item (assumed constant), and k the number of categorical ratings. For item i and category j, let n_{ij} be the count of raters assigning category j. The observed proportion of agreement for item i is computed via quantities introduced in classical treatments by Samuel S. Wilks and Ronald A. Fisher; aggregated probabilities use marginal category proportions p_j akin to methods in contingency table analysis employed by Karl Pearson and William Gosset. Fleiss' kappa is defined as P̄ = (1/N) Σ_i P_i, P̄_e = Σ_j p_j^2, κ = (P̄ − P̄_e) / (1 − P̄_e), where P_i = [1/(n(n−1))] Σ_j n_{ij}(n_{ij}−1) and p_j = (1/(N n)) Σ_i n_{ij}. This formulation situates Fleiss' kappa within the lineage of estimators developed by Andrey Kolmogorov and Jerzy Neyman for categorical data inference.

Properties and interpretation

Fleiss' kappa ranges theoretically from −1 to 1. Values near 1 indicate near-perfect agreement beyond chance, values near 0 indicate agreement comparable to chance expectations, and negative values indicate systematic disagreement. Interpretive schemes often cite thresholds proposed in applied research traditions associated with scholars and institutions such as Jacob Cohen, Landis and Koch, American Psychological Association, and reporting practices at journals like Nature and The Lancet. Fleiss' kappa assumes fixed numbers of raters per item and nominal categories, and like Cohen's kappa it can exhibit paradoxical behavior when marginal distributions are highly skewed, a phenomenon noted in critiques from statisticians at University of Chicago and Columbia University.

Estimation and variance

Point estimation uses the sample proportions p_j and observed P̄. Asymptotic variance formulas for κ derive from large-sample theory developed by figures including Jerzy Neyman and Egon Pearson, with practical variance estimators available in texts by George Casella and Roger A. Johnson. The standard error is often estimated via plug-in estimators or resampling methods such as the bootstrap introduced by Bradley Efron, which has been implemented in statistical software maintained by institutions including R Project and StataCorp. Confidence intervals for κ may be constructed using normal approximation, percentile bootstrap, or bias-corrected methods employed in computational studies at Massachusetts Institute of Technology and Carnegie Mellon University.

Applications and examples

Fleiss' kappa has been applied in diverse empirical domains: diagnosing inter-rater reliability among clinicians in Johns Hopkins Hospital and Mayo Clinic; coding qualitative data in studies at University of California, Berkeley and University College London; labeling image datasets in computer vision research at Google Research, Facebook AI Research, and Microsoft Research; and content analysis in media studies referencing outlets like The New York Times and BBC. Example applications include rating radiographic images for pathology presence, classifying legal case types in projects at Harvard Law School, and annotating speech corpora in collaborations with ITU and European Commission research programs. Reported κ values in applied literature typically range from poor to excellent depending on task difficulty and coder training, with many medical diagnostic studies reporting κ in the 0.4–0.8 range.

Limitations and alternatives

Limitations include sensitivity to prevalence and bias effects, dependence on equal rater counts per item, and counterintuitive interpretation under skewed marginals—issues discussed by researchers at Princeton University, Yale University, and University of Toronto. Alternatives and extensions include Cohen's kappa for two raters, Scott's pi in content analysis, Krippendorff's alpha for variable rater counts and missing data, and the Intraclass correlation coefficient for ordinal or continuous ratings. Advanced modeling approaches—such as Bayesian latent class models used by teams at University of Oxford and ETH Zurich and item response theory frameworks developed in psychometrics at University of Chicago—offer ways to account for rater bias, item difficulty, and varying rater reliability.

Category:Statistical reliability measures