- Abstract
- Introduction
- Background: Item Response Theory
- Method
- Conferences searched
- Scrapers implemented
- LLM and human coding of returned abstracts
- Results
- Papers indentified
- Methodological audit findings
- Standard versus custom models
- Latent trait definitions and response formats
- Estimation and convergence
- Sample (persons, items) adequacy
- Calibration validation split and model comparisons
- Dimensionality and local independence assumptions
- Convergence and parameter reporting
- Statistical and graphical model fit
- Discussion
- Proposed Checklist for IRT in NLP and AI Research
- Limitations
- AI disclosure
- References
Abstract
Item Response Theory (IRT) is growing in prominence within machine learning yet no systematic review has examined whether these applications align with established psychometric practice. We identify 58 IRT papers across nine major ML conference venues since 2017, the year transformers emerged. We coded these papers using an LLM-assisted schema to study application context and methodological rigour. We end with a proposal for a reporting checklist of ten requirements based on these findings to bridge psychometric convention and ML practice so that the results of IRT analyses in high-stakes AI contexts can be trusted. This review is in process and this sentence will be deleted when the section is complete.
Introduction
Item Response Theory (IRT) is a well-established psychometric measurement technique that is has useful applications in machine learning (ML), particularly for model evaluation. It’s use is becoming more pronounced as model evaluation recommendations move towards construct driven LLM evaluations (Zhou et al., 2026) that recognize model ability and item characteristics must be modelled with formal psychometric frameworks.
Despite this convergence of psychometrics and ML and the growing recognition of the need for psychometrics, IRT has only recently started to appear in the machine learning literature. To date there is no clear picture of what purposes IRT is being used for in machine learning and NLP or whether overall application practices are aligned with lessons learned from psychological and educational measurement.
Understanding how IRT is being used in machine learning and NLP and whether this aligns with best practices in psychometric measurement is important because LLM performance is an increasingly high stakes endeavour with benchmark results influencing which models are operationally deployed. Incorrect applications of IRT in this context can lead to incorrect conclusions about LLM capabilities and typical behaviors.
In response, this section systematically explores IRT applications in machine learning contexts in the transformer era. We examine both typical applications of IRT and IRT reporting practices in machine learning. We use 2017, the year Vaswani et al. (2017) appeared, as the year from which we began our searches because transformers changed the way ML models function to be more consistent with the way humans function, increasing the relevant of IRT in ML.
In the ML and NLP context, the respondent is frequently a model rather than a human, and the latent trait of interest is model ability rather than a psychological construct. The formal structure of IRT nonetheless applies: items vary in difficulty and discrimination, models vary in ability, and the probability of a correct response is a function of both. This makes IRT an appealing framework for LLM evaluation problems.
This section makes three contributions. First, we provide a systematic mapping of how IRT is applied in ML and NLP in the transformer era. Second, we evaluate these applications against established criteria, assessing how assumptions and reporting practices are addressed. Third, we propose a minimal set of reporting and modeling standards for the use of IRT in ML contexts to improve the validity and interpretability of future IRT work in ML.
Background: Item Response Theory
IRT is a family of psychometric models that estimate the probability of a specific response to a test item as a function of the properties of the item and the ability of the respondent (Baker, 2001; Embretson & Reise, 2000; de Ayala, 2009; Lord & Novick, 1968). IRT models item and person parameters as invariant across samples, unlike classical test theory where item difficulty is population-dependent.
IRT models rest on two core assumptions. Unidimensionality requires that a single latent trait accounts for the pattern of responses across items. Local independence requires that item responses are uncorrelated after conditioning on that latent trait. Violations of these assumptions can bias parameter estimates and undermine the validity of any conclusions drawn from the model (Foster et al., 2017).
Method
Conferences searched
We conducted a systematic search of major ML conference families to identify papers applying IRT since Vaswani et al. (2017) introduced the transformer architecture. We coded scrapers to query NeurIPS via its official proceedings site, ICML via the PMLR proceedings site, ICLR via Semantic Scholar, AAAI via metadata-linked DOI and proceedings pages, and ACL family venues (ACL, EMNLP, NAACL, EACL) via the ACL Anthology website.
Scrapers implemented
The search and screening procedures were structured to align with key elements of PRISMA-style systematic reviews, covering publications from 2017 to April 2026. Main conference and journal papers were included; workshop papers were excluded. Two independent scrapers were developed using different keyword matching strategies. The first used substring matching, the second used word boundary matching.
Each scraper searched paper titles and abstracts using a two-tier keyword system. Tier 1 keywords included direct IRT terminology such as item Response Theory, Rasch, 2PL, 3PL and graded response model. Tier 2 used related phrases such as ability estimation. Results were exported to CSV containing year, conference, matched keyword, authors, title, abstract and URL. A script was written to obtain these articles for analysis.
LLM and human coding of returned abstracts
Our framework included a methodological layer capturing how rigorously IRT models were applied and reported and an application layer capturing how IRT was used in each paper and. Claude code was used to extract whether: custom or standard models were applied, the latent trait was defined, dimensionality and local independence were checked, model fit statistics and visual fit plots were reported, multiple models were compared, item/person parameter estimates were reported, and sample size (persons, items) was stated.
We used OpenAI’s GPT‑4o for coding. Coded outputs then were audited using an independent verification pass with another LLM, this time Claude Sonnet 4.6 which checked support without over-writing. Coding required explicit evidence where available. Descriptive fields were coded based on the paper’s context when explicit statements were absent. Methodological audit fields required verbatim evidence. Results were written to JSON and CSV for further analysis.
Scraper 1 returned 335 and scraper 2 returned 43 abstracts. Human review eliminated abstracts where IRT was part of other words (e.g., birth, dirt*, thirteen, virtual*) or matched key phrases for papers not involving IRT (e.g., ability in probability) yielding 100 abstracts. Human review deduplicated the file and Naganuma et al. (2026) was added manually for a final set of 62. PDFs were downloaded with a script (20 files) and manually (42 files). Manual screening revealed 3 false positives giving 58 articles. Full text was extracted using PDFplumber for further analysis.
Results
Coding reliability was assessed via an independent verification pass by a second LLM, which judged 55% of codings as supported by verbatim evidence, 35% as ambiguous, and 6% as unsupported. We human checked ambiguous and unsupported categories in the results that follow.
Papers indentified
Across nine major computer science conference venues from 2017 to 2026 (April) we identified 58 papers using IRT. The ‘All venues’ row rises from very low early counts in the early period to a peak of 15 in 2025 with the 2026 year high at 12 and not yet calendar complete. IRT is moving from a niche method to more commonly accepted technique but is still not ubiquitous or standard. Most papers appear after 2022, likely when transformer influence emerged more widely and IRT was used in model evaluations. AAAI emerged as the dominant outlet accounting for 28 papers across five years.
Methodological audit findings
Standard versus custom models
Papers were coded using GPT-4o, which extracted the IRT model type from each paper chunk. The model was instructed to identify whether the paper used a 1PL/Rasch, 2PL, 3PL, GRM, GPCM, MIRT, CDM or custom IRT model, based only on verbatim evidence from the text. Conservative coding rules were applied. 'Unclear' was a last resort, and any model modification was coded as 'custom'. Across all 58 papers, the most common model type was custom, reflecting the preference for novel or extended IRT-based approaches in the ML/NLP literature. Standard models were also well represented.
Latent trait definitions and response formats
Estimation and convergence
Sample (persons, items) adequacy
Calibration validation split and model comparisons
Dimensionality and local independence assumptions
Convergence and parameter reporting
Statistical and graphical model fit
Discussion
Key Findings
Implications for Machine Learners Using IRT in Practice
Proposed Checklist for IRT in NLP and AI Research
Limitations
The search was restricted to titles and abstracts, meaning papers where IRT appears only in the body text may have been missed. Coverage of ICLR relied on Semantic Scholar rather than direct proceedings access, which may have introduced gaps in that venue. NeurIPS 2025 proceedings were not yet published at the time of search and are therefore excluded. Workshop papers were excluded throughout, though some relevant IRT work has appeared in workshop venues.
AI disclosure
The author used Claude (Anthropic) and OpenAI to assist with code development, data pipeline construction, and drafting text.
References
Baker, F. B. (2001). The basics of item response theory (2nd ed.). ERIC Clearinghouse on Assessment and Evaluation.
de Ayala, R. J. (2009). The theory and practice of item response theory. Guilford Press.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates.
Foster, G. C., Min, H., & Zickar, M. J. (2017). Review of item response theory practices in organizational research: Lessons learned and paths forward. Organizational Research Methods, 20(3), 465-486.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.
Zhou, L., Pacchiardi, L., Martínez-Plumed, F., Collins, K. M., Moros-Daval, Y., Zhang, S., Zhao, Q., Huang, Y., Sun, L., Prunty, J. E., Li, Z., Sánchez-García, P., Jiang-Chen, K., Casares, P. A. M., Zu, J., Burden, J., Mehrbakhsh, B., Stillwell, D., Cebrian, M., Wang, J., Henderson, P., Wu, S. T., Kyllonen, P. C., Cheke, L., Xie, X., & Hernández-Orallo, J. (2026). General scales unlock AI evaluation with explanatory and predictive power. Nature, 652, 58–65. https://doi.org/10.1038/s41586-026-10303-2
Next section
Historical roots of AI assessment debates
Last section
Scoring (3 of 3): Neural contrastive pairwise regression (NCPR)
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).