Why AI Psychometrics Works: The Role of Transformers and Embeddings

The power of transformers
Human ratings and machine embeddings have notable parallels
Representational and generative substitutability assumptions
References

The power of transformers

AI psychometrics is making fast progress because of Transformers models. Transformers are pre-trained neural network models that are extremely efficient at modeling sequence data, such as dependencies between tokenized language.

Transformers include encoder models that create embeddings for natural language understanding (NLU), decoder models used for Generative AI text generation tasks, and encoder-decoder models which are used for sequence-to-sequence tasks such as translation. Hussain, Wulf and Mata (2025) provided a useful introduction to transformers for behavioural science.

Researchers have shown that models estimated on empirical item response data (e.g. latent variable models like IRT, Factor Analysis, and also Networks) can be applied to item embeddings. Embeddings are used to form a matrix of embeddings associations for analysis with conventional methods.

Interestingly, conventional psychometric approaches applied to embeddings yield measurement model parameters that are strongly related to parameters estimated on empirical data. Demonstrations so far have focused on showing the practical utility of these parameter correspondences. The precise reasons are not yet clear, here we explore early insights.

Human ratings and machine embeddings have notable parallels

Early discussion by Guenole et al. (2025) noted parallels between the processes that generate the data from A.I vector representations and empirical item responses. Person responses to scale items are human encodings of the descriptiveness of scale items. These are aggregated to form a sample matrix of item response covariances or correlations for the test taking population.

Pre-trained encoder model’s item embedding representations are also encodings used to form an association matrix. If test takers read scraped training content, it may directly influence ratings. If scraped content reflects cultural norms, it may indirectly reflect human test takers’ knowledge and world views. Web content can plausibly influence human and machine encodings.

In the case of decoder models and generative AI, the substitution is more direct. Generative AI models have been used to produce human like responses to assessments. When AI responds like a human hundreds or even thousands of times, an item response data set is produced that can be analyzed in conventional ways, i.e., the AI directly substitutes for the test taker.

Here we consider further parallels that may exist between human ratings of scale statements and LLM embeddings of those same statements. We consider parallels both prior to and following the encoding stage, as we expect these parallels may explain the parameter correspondences across the two methods.

Table 1. Common stages between human and ratings and LLM encodings

Process Stage	Human Ratings	LLM Encodings
Content authoring	Test designers write items or statements to measure constructs	Humans post text content on the Internet
Pre-processing	Items reviewed by experts for content and relevance	Text is cleaned and tokenized
Task framing	Test instructions by developers shape perceptions	Pre-training objectives by model designers shape model design
World view source	Humans draw on self knowledge from personal learning, and cultural experience	Models acquire knowledge from shared cultural data via web scraping
Stimulus interpretation	Humans activate self knowledge in response to test items and encode self descriptiveness as ratings	Model approaches items using distributional knowledge of language encoded in model weights
Matrix representation	Individual responses collected to and turned into a correlation matrix	Embeddings are collated and turned into a cosine similariy matrix
Matrix interpretation	Shared human self knowledge is reflected in the correlational patterns	Shared corpus languge usage is reflected in the cosine similarity patterns
Analysis	Psychometric analysis via latent variable models (e.g. Factor Analysis, IRT) or network models	Psychometric analysis via latent variable models (e.g. Factor Analysis, IRT) or network models

Representational and generative substitutability assumptions

Guenole et al., (2025) has proposed a heuristic explanation called the substitutability assumption to begin to understand the complex process of explaining why AI psychometrics works, something we currently know very little about. In the case of encoders that generate embeddings for representational AI, the substitutability assumption suggests that item embedding vectors can substitute empirical vector of human item responses under certain yet-to-be specified conditions.

In the case of decoder models and generative AI, the AI can directly substitute for the test taker to create data sets based on responses ‘artificial crowds’ e.g. Wang et al. (2025), or alternatively, the LLM can act as a psychologist in zero or few shot prompting approaches to tasks such item allocation to scales or in making predictions of traits based on input text from personality relevant content from subjects (e.g. Maharjan et al. 2025).

The substitutability assumption relies heavily on analogy rather than scientific evidence. It is unclear what properties of embeddings let them capture trait related variance, what the effects of training data composition are, and whether fine-tuning can bring training data closer into line with assessment populations. The field does not understand these processes very well yet. Despite not knowing the precise mechanisms through which embedding and empirical psychometrics correspondences occur, the parallels offer powerful psychometric capabilities.

References

Guenole, N., D'Urso, E. D., Samo, A., Sun, T., & Haslbeck, J. (2025). Enhancing Scale Development: Pseudo Factor Analysis of Language Embedding Similarity Matrices.

Maharjan, J., Jin, R., Zhu, J., & Kenne, D. (2025). Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction. Journal of Medical Internet Research, 27, e75347.

Wang, Y., Zhao, J., Ones, D. S., He, L., & Xu, X. (2025). Evaluating the ability of large language models to emulate personality. Scientific reports, 15(1), 519.

Next section

Understanding LLM designs

Last section

What is AI psychometrics?

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

Substitutability assumption

The power of transformers

Human ratings and machine embeddings have notable parallels

Representational and generative substitutability assumptions

References