Reasons for AI psychometric progress

Reasons for AI psychometric progress

The power of transformers

AI psychometrics is making fast progress because of Transformers models. Transformers are pre-trained neural network models that are extremely efficient at modeling sequence data, such as dependencies between tokenized language. Natural Language Processing (NLP) Transformers include encoder models that create embeddings for natural language understanding (NLU), decoder models used for Generative AI text generation tasks, and encoder-decoder models which are used for sequence-to-sequence tasks such as translation. Hussain, Wulf and Mata (2025) provided a useful introduction to transformers for behavioural science.

Embeddings as psychometric data

Researchers have shown that models estimated on empirical item response data (e.g. latent variable models like IRT, Factor Analysis, and also Networks) can be applied to item embeddings. Embeddings are used to form a matrix of embeddings associations for analysis with conventional methods. Interestingly, conventional psychometric approaches applied to embeddings are yielding measurement model parameters that are strongly related to parameters estimated on empirical data. Demonstrations so far have focused on showing the practical utility of these parameter correspondences.

Substitutability assumption

Guenole et al., (2025) has proposed a heuristic explanation called the substitutability assumption to begin to understand the complex process of explaining why AI psychometrics works, something we currently know very little about. The substitutability assumption suggests that item embedding vectors can be a substitute for an empirical vector of human item responses under certain yet-to-be specified conditions.

Human and machine ratings have notable parallels

Early discussion noted parallels between the processes that generate the data from A.I vector representations and empirical item responses. Person responses to scale items are human encodings of the descriptiveness of scale items. These are aggregated to form a sample matrix of item response covariances or correlations for the test taking population.

Pre-trained encoder model’s item representations are also encodings used to form an association matrix. If test takers read scraped training content, it may directly influence ratings. If scraped content reflects cultural norms, it may indirectly reflect human test takers’ knowledge and world views. Web content can plausibly influence human and machine encodings.

Here we consider further parallels that may exist between human ratings of scale statements and LLM embeddings of those same statements. We consider parallels both prior to and following the encoding stage, as we expect these parallels may explain the parameter correspondences across the two methods.

Table 1. Common stages between human and ratings and LLM encodings

Process Stage
Human Ratings
LLM Encodings
Content authoring
Test designers write items or statements to measure constructs
Humans post text content on the Internet
Pre-processing
Items reviewed by experts for content and relevance
Text is cleaned and tokenized
Task framing
Test instructions by developers shape perceptions
Pre-training objectives by model designers shape model design
World view source
Humans draw on self knowledge from personal learning, and cultural experience
Models acquire knowledge from shared cultural data via web scraping
Stimulus interpretation
Humans activate self knowledge in response to test items and encode self descriptiveness as ratings
Model approaches items using distributional knowledge of language encoded in model weights
Matrix representation
Individual responses collected to and turned into a correlation matrix
Embeddings are collated and turned into a cosine similariy matrix
Matrix interpretation
Shared human self knowledge is reflected in the correlational patterns
Shared corpus languge usage is reflected in the cosine similarity patterns
Analysis
Psychometric analysis via latent variable models (e.g. Factor Analysis, IRT) or network models
Psychometric analysis via latent variable models (e.g. Factor Analysis, IRT) or network models

Much to learn but the future is bright

The substitutability assumption relies heavily on analogy rather than scientific evidence. It is unclear what properties of embeddings let them capture trait related variance, what the effects of training data composition are, and whether fine-tuning can bring training data closer into line with assessment populations. The field does not understand these processes very well yet. Despite not knowing the precise mechanisms through which embedding and empirical psychometrics correspondences occur, the parallels offer powerful psychometric capabilities.

Next section

What’s in an LLM?

Last section

What is AI psychometrics?

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).