Item response theory

Item response theory
Name	Item response theory
Field	Psychometrics
Developer	Georg Rasch; Frederic Lord; Birnbaum
Introduced	1950s–1960s

Contents

Overview
Mathematical models
Estimation and scaling methods
Applications
Assumptions and model fit
Software and implementation
History and development

Item response theory.

Item response theory is a family of probabilistic models used to relate individual responses on test items to latent traits such as ability, aptitude, or attitude. It provides item-level parameterization that enables comparisons across instruments, equating of scales, and precision estimation for individuals, widely informing practice in standardized testing, psychological assessment, and health measurement. IRT interfaces with classical test theory traditions and modern computational methods developed in psychometrics, statistics, and measurement science.

Overview

IRT models describe the probability that a respondent with a given latent trait level will endorse or answer an item correctly as a function of item parameters and person parameters. Pioneering figures associated with the development and dissemination of IRT concepts include Georg Rasch, Frederic M. Lord, Floyd J. Lord (note: often cited with Frederic Lord), Birnbaum, and institutions such as Educational Testing Service, National Council on Measurement in Education, and United States Department of Education research programs. Canonical use cases span credentialing by College Board, licensure by National Board of Medical Examiners, large-scale surveys by Organisation for Economic Co-operation and Development, and clinical outcomes measurement in initiatives like PROMIS.

Mathematical models

Core parametrizations include logistic and normal-ogive formulations. The simplest is the one-parameter logistic model (1PL), associated historically with Georg Rasch. The two-parameter logistic model (2PL) introduces item discrimination and links to the work of Frederic M. Lord and Melvin Birnbaum. The three-parameter logistic model (3PL) adds a pseudo-guessing parameter and is prominent in high-stakes programs administered by Educational Testing Service and College Board. Polytomous extensions include the graded response model by Fumiko Samejima and the partial credit model by Georg Rasch and David Andrich. Multidimensional IRT models extend unidimensional formulations and connect to factor analytic traditions advanced by scholars affiliated with University of Chicago, University of Illinois Urbana-Champaign, and University of Michigan research groups.

Estimation and scaling methods

Parameter estimation relies on maximum likelihood, marginal maximum likelihood, and Bayesian techniques. Marginal maximum likelihood with the EM algorithm was popularized in software developed at Educational Testing Service and by researchers at University of California, Berkeley and Stanford University. Markov chain Monte Carlo and Hamiltonian Monte Carlo implementations draw on work from Columbia University, Princeton University, and the University of Cambridge statistics communities. Scale linking and equating procedures use common-item equating and concurrent calibration methodologies employed by organizations such as ACT, Inc. and Institute of Education Sciences programs.

Applications

IRT underpins scoring and test development in operational programs run by College Board, ACT, Inc., Medical College Admission Test administrators, and professional certification bodies like National Board of Certification and Recertification of Nurse Anesthetists. It supports patient-reported outcome measurement in projects from National Institutes of Health initiatives and health outcomes units at Johns Hopkins University and Mayo Clinic. In social science research, IRT is applied in large-scale assessments by Programme for International Student Assessment and survey research conducted by Pew Research Center. Industrial and organizational applications appear in competency frameworks used by Society for Industrial and Organizational Psychology affiliated practitioners.

Assumptions and model fit

Key assumptions include unidimensionality, local independence, and monotonicity; tests for these assumptions draw on techniques from latent variable modeling developed at University of Chicago and Northwestern University. Model fit evaluation uses item fit statistics, information functions, and graphical analyses popularized in literature from Psychometric Society conference proceedings and journals associated with American Educational Research Association. Misfit can arise from differential item functioning tied to groups studied in reports by U.S. Census Bureau or advocacy research by Human Rights Watch when fairness issues are examined.

Software and implementation

Common software tools include programs and libraries such as BILOG-MG, MULTILOG, IRTPRO, and open-source packages implemented in environments supported by R Project for Statistical Computing and Python Software Foundation. Bayesian implementations leverage platforms like Stan (software), building on probabilistic programming work from Carnegie Mellon University and Columbia University. Test delivery platforms integrating IRT frameworks are used by ETS and computerized adaptive testing systems developed in collaborations with University of Massachusetts Amherst researchers.

History and development

Foundational theoretical work began with Georg Rasch in the 1950s, formalized further through efforts at Educational Testing Service and academic programs at University of Chicago, University of Illinois Urbana-Champaign, and Stanford University during the 1960s–1980s. Seminal texts and monographs were authored by figures such as Frederic M. Lord and others presenting logistic formulations and estimation strategies used by testing programs run by College Board and ETS. Subsequent decades saw expansion into computerized adaptive testing pioneered by collaborations between Carnegie Mellon University and University of Minnesota researchers, and growth in health outcomes measurement through initiatives at National Institutes of Health.

Category:Psychometrics