IRT — LLMpedia

IRT
Name	IRT
Field	Psychometrics

Contents

Introduction
History and Development
Theoretical Foundations
Models and Parameters
Applications
Software and Implementation
Criticisms and Limitations

IRT

Item response theory (IRT) is a family of mathematical models used to analyze responses to assessments, questionnaires, and diagnostic instruments. It links latent traits associated with individuals to the probability of particular item responses, allowing calibration of items, estimation of abilities, and evaluation of measurement precision. IRT underpins modern practices in large-scale testing, computerized adaptive testing, and instrument development across psychometrics, educational measurement, and health outcomes research.

Introduction

IRT traces its foundations to work on latent trait models and probabilistic scaling in the early 20th century; it formalizes how item characteristics interact with examinee traits. Prominent users of IRT models include testing programs such as Educational Testing Service, Graduate Record Examinations, and national assessment programs like Programme for International Student Assessment. IRT provides links between test takers and items that support adaptive administration by systems developed by organizations such as Khan Academy partners and vendors like Pearson PLC.

History and Development

The conceptual roots of IRT emerged alongside developments by figures and institutions including Guttman scale pioneers and formalizations influenced by work at University of Chicago departments and researchers associated with Lord, Frederic M. and Rasch, Georg traditions. The mid-20th century saw probabilistic models refined within research groups at Princeton University, University of Chicago, and University of Minnesota, while operational deployment accelerated with large-scale projects at Educational Testing Service and international assessments coordinated by Organisation for Economic Co-operation and Development. Advances in computation during the late 20th century, driven by processors from firms such as Intel Corporation and software efforts at University of California, Berkeley and Stanford University, enabled widespread use of three-parameter and graded response models in commercial and research settings.

Theoretical Foundations

Foundational theoretical work draws on contributions from mathematicians and statisticians associated with institutions like Harvard University, Columbia University, and University of Cambridge. Core principles relate latent trait continuity and item characteristic functions represented by logistic or normal ogive curves, concepts investigated by researchers linked to Lord, Frederic M. and Samejima, Fumiko. Likelihood-based estimation, Bayesian frameworks, and marginal maximum likelihood techniques were expanded through collaborations involving David J. Bartholomew and groups at University of Chicago. The Rasch model, championed by Rasch, Georg, occupies a particular philosophical stance emphasizing sufficiency and specific objectivity, influencing measurement practice in contexts involving institutions such as World Health Organization and National Institutes of Health.

Models and Parameters

IRT encompasses multiple models characterized by parameters that represent item difficulty, discrimination, and guessing. The one-parameter logistic model (1PL) associates with the Rasch tradition linked to Rasch, Georg; the two-parameter logistic model (2PL) and three-parameter logistic model (3PL) are applied in testing programs by organizations like Educational Testing Service and College Board. The graded response model, nominal response model, and partial credit model extend IRT to polytomous items and rating scales, techniques utilized in surveys conducted by Pew Research Center and health outcome instruments from RAND Corporation. Parameter estimation methods include marginal maximum likelihood, expectation-maximization algorithms, and Markov chain Monte Carlo implemented in research groups at University of Oxford and University of Michigan.

Applications

IRT is widely applied to educational testing, certification exams, psychological assessment, and patient-reported outcome measures. High-stakes applications feature tests such as the Graduate Record Examinations, Scholastic Assessment Test, and professional licensure exams administered by bodies like National Council of State Boards of Nursing. In health science, IRT informs item banks and computerized adaptive testing for instruments first developed at centers like PROMIS within Northwestern University and supported by collaborations with National Institutes of Health. Cross-cultural assessment and linking studies involve organizations such as OECD’s Programme for International Student Assessment and efforts coordinated by UNICEF and World Bank.

Software and Implementation

A range of software supports IRT modeling, estimation, and test assembly. Commercial packages and open-source projects from entities such as IBM, Microsoft, RStudio, and academic groups provide tools: prominent examples include BILOG-MG used historically at testing companies, IRTPRO maintained by researchers, and R packages developed by contributors affiliated with University of Illinois and Carnegie Mellon University. General-purpose statistical environments such as R (programming language) and Python (programming language) host libraries for marginal maximum likelihood, Bayesian estimation, and adaptive testing engines used by research teams at Massachusetts Institute of Technology and Johns Hopkins University.

Criticisms and Limitations

Criticisms of IRT arise from debates over model fit, assumptions of unidimensionality, and fairness across demographic groups. Psychometricians and ethicists from institutions including American Educational Research Association and National Council on Measurement in Education have highlighted challenges in detecting differential item functioning and ensuring validity for diverse populations represented in studies by National Center for Education Statistics. Practical constraints include the need for large calibration samples, computational complexity noted by groups at Cornell University, and disputes over the philosophical underpinnings illustrated in debates between proponents associated with Rasch, Georg and classical IRT advocates linked to Lord, Frederic M..

Category:Psychometrics