- Additional step to valid person scores
- Challenges and opportunities this creates
- Possibility of factor analyzing free text
Additional step to valid person scores
Item parameters in conventional psychometrics, such as difficulty and discrimination, are unsupervised and model dependent internal quantities estimated from item response patterns. These parameters, in turn, enable accurate estimation of person scores. In AI psychometrics, item parameters are either predicted from the embeddings of item language or estimated from data that can be substituted for empirical responses (e.g., artificial crowds, cosine similarity matrices of item embeddings). This means we instead obtain pseudo parameters that require external item parameters as ground truth.
The net effect of this is that AI psychometrics leaves us a step further removed than conventional psychometrics from our ultimate goal of reliable and valid person scores. This creates an interim validation need. Even use cases that do not involve approximating quantitative values at their core still frequently require a quantitative quality metric that will be a step removed from a conventional psychometric quantity counterpart. Whether interest is in measurement modelling or the structural associations between constructs, this further degree of separation effect provides new validation challenges for the measurement field to grapple with.
Challenges and opportunities this creates
The first challenge is that unfortunately we cannot simply use AI-derived parameters to substitute for real item parameters without first checking their consistency against empirical item parameters derived from real human responses. The degree to which this extra inference step introduces consequential error is an open question that must be checked rather than assumed.
The second challenge is answering how similar pseudo and actual parameters need to be for applied utility. Guenole et al. (2025) showed that pseudo loadings had strong correlations with empirical loadings, but these values were not invariant with respect to empirical loadings. Nonetheless, the strong correspondence offers a potential solution to the ‘cold start’ problem where we need pre-information prior to initial live use.
Possibility of factor analyzing free text
An exception to the one step removal situation occurs in scoring of constructed responses and other free text data sources. Here we are at the same proximity to reliable and valid person scores albeit with free text to score. Ground truth in this case is usually either predictions of human scores from embeddings or zero or few shot classification. Conventional factor scoring is less feasible because free text responses have different dimensional representations across candidates.
Embeddings may resolve this by standardizing variable-length passages into fixed-dimensional vectors to allow factor analysis on text passages of different lengths. This re-introduces the possibility of measurement based on unsupervised internal structures. In contrast, machine learning has focused on predicting external ground truth, a notable difference given many psychometric breakthroughs were based on internal structural analyses.
Next section
Semantic item alignment results
Last section
Scoring (1 of 3): Zero and Few-shot LLMs
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).