Generated by GPT-5-mini| Cohen's kappa | |
|---|---|
| Name | Cohen's kappa |
| Introduced | 1960 |
| Inventor | Jacob Cohen |
| Field | Statistics |
| Related | Pearson correlation coefficient, Spearman's rank correlation coefficient, Fleiss' kappa |
Cohen's kappa is a statistic used to measure inter-rater agreement for categorical items beyond chance. Developed in 1960 by Jacob Cohen, it quantifies concordance between two raters classifying the same set of subjects and adjusts observed agreement by the agreement expected under independence assumptions. The measure has been widely applied across fields including Medicine, Psychology, Sociology, Epidemiology and Linguistics, and it is commonly reported alongside measures such as the Pearson correlation coefficient and Intraclass correlation coefficient.
Cohen's kappa expresses agreement as the proportion of observed agreement remaining after removing agreement expected by chance, and it is interpreted on a scale where 1 indicates perfect agreement, 0 indicates agreement equivalent to chance, and negative values indicate less-than-chance agreement. In applied reports, thresholds attributed to authors such as Jacob Cohen or classification schemes used by organizations like the World Health Organization or editorial guidelines from journals in The Lancet and JAMA are often cited to describe "slight", "fair", "moderate", "substantial", or "almost perfect" agreement. Practitioners in settings involving Centers for Disease Control and Prevention studies, National Institutes of Health trials, or American Psychological Association research must consider the influence of category prevalence and bias when interpreting kappa values.
For two raters assigning N subjects to k mutually exclusive categories, denote the observed proportion of agreements by P_o and the chance-expected proportion by P_e, often computed from marginal proportions. Cohen's kappa is defined as κ = (P_o − P_e) / (1 − P_e). The observed agreement P_o is the trace of the contingency table of counts divided by N; P_e = Σ_i p_{i1} p_{i2}, where p_{i1} and p_{i2} are the marginal proportions for category i for rater 1 and rater 2 respectively. This formulation parallels algebraic constructions used in measures like the Phi coefficient and contrasts with weighted forms that incorporate squared-distance or linear weights similar to modifications used in Quadratic weighted kappa approaches. The maximum and minimum attainable κ depend on the marginal distributions; boundary analyses relate to works by statisticians associated with institutions such as the University of Chicago and Harvard University.
Point estimation of κ uses the observed contingency table; standard errors and confidence intervals can be obtained via asymptotic normal approximations or resampling methods. Asymptotic variance formulas originate from delta-method derivations and have been refined in methodological papers from researchers affiliated with Stanford University, University of Oxford, and Columbia University. Bootstrap confidence intervals (nonparametric, percentile, or bias-corrected) are often implemented in software packages maintained by groups at RStudio, The Comprehensive R Archive Network, and statistical toolkits from SAS Institute and StataCorp. Hypothesis tests for κ compare the estimated parameter to hypothesized values (e.g., κ = 0), and permutation tests have been advocated in epidemiological and diagnostic studies reported by teams at Johns Hopkins University and Mayo Clinic to control type I error under small sample sizes or sparse tables.
Several extensions address limitations of the original two-rater, unweighted formulation. Weighted kappa introduces weights for ordered categories; quadratic and linear weights are associated with methodological traditions in medical statistics at Oxford University Hospitals and psychometrics research from Carnegie Mellon University. Multi-rater generalizations include Fleiss' kappa, used in many multicenter trials coordinated by organizations like World Bank research teams and multinational consortia. Other variants incorporate prevalence- and bias-adjusted forms, and probabilistic models for agreement arise in latent class analysis and hierarchical models developed in statistical groups at University of California, Berkeley and Princeton University. Recent machine-learning literature links agreement measures to evaluation metrics used by projects at Google Research, OpenAI, and academic groups in MIT and ETH Zurich for annotator reliability in natural language and image datasets.
Applications span diagnostic classification in American Medical Association-affiliated clinical research, content analysis in New York University and Columbia University social science labs, and annotation tasks used by teams at Facebook, Twitter, and other technology firms. Limitations include sensitivity to category prevalence and marginal imbalance (the "kappa paradox"), dependence on the number of categories, and potential misinterpretation when raters use different implicit criteria—a concern in multicenter studies by the European Respiratory Society and in behavioral coding in developmental research at University College London. Alternatives or complements—such as reporting raw agreement, prevalence indices, bias indices, or using latent variable models—are recommended by methodologists at American Statistical Association and in guidance from editorial boards of journals like Statistics in Medicine.
Category:Statistical measures