- The power of transformers
- Embeddings as psychometric data
- Substitutability assumption
- Human and machine ratings have notable parallels
- Much to learn but the future is bright
The power of transformers
AI psychometrics is making fast progress because of Transformers models. Transformers are pre-trained neural network models that are extremely efficient at modeling sequence data, such as dependencies between tokenized language. Natural Language Processing (NLP) Transformers include encoder models that create embeddings for natural language understanding (NLU), decoder models used for Generative AI text generation tasks, and encoder-decoder models which are used for sequence-to-sequence tasks such as translation. Hussain, Wulf and Mata (2025) provided a useful introduction to transformers for behavioural science.
Embeddings as psychometric data
Researchers have shown that models estimated on empirical item response data (e.g. latent variable models like IRT, Factor Analysis, and also Networks) can be applied to item embeddings. Embeddings are used to form a matrix of embeddings associations for analysis with conventional methods. Interestingly, conventional psychometric approaches applied to embeddings are yielding measurement model parameters that are strongly related to parameters estimated on empirical data. Demonstrations so far have focused on showing the practical utility of these parameter correspondences.
Substitutability assumption
Guenole et al., (2025) has proposed a heuristic explanation called the substitutability assumption to begin to understand the complex process of explaining why AI psychometrics works, something we currently know very little about. The substitutability assumption suggests that item embedding vectors can be a substitute for an empirical vector of human item responses under certain yet-to-be specified conditions.
Human and machine ratings have notable parallels
Early discussion noted parallels between the processes that generate the data from A.I vector representations and empirical item responses. Person responses to scale items are human encodings of the descriptiveness of scale items. These are aggregated to form a sample matrix of item response covariances or correlations for the test taking population.
Pre-trained encoder model’s item representations are also encodings used to form an association matrix. If test takers read scraped training content, it may directly influence ratings. If scraped content reflects cultural norms, it may indirectly reflect human test takers’ knowledge and world views. Web content can plausibly influence human and machine encodings.
Here we consider further parallels that may exist between human ratings of scale statements and LLM embeddings of those same statements. We consider parallels both prior to and following the encoding stage, as we expect these parallels may explain the parameter correspondences across the two methods.
Table 1. Common stages between human and ratings and LLM encodings
Process Stage | Human Ratings | LLM Encodings |
Content authoring | Test designers write items or statements to measure constructs | Humans post text content on the Internet |
Pre-processing | Items reviewed by experts for content and relevance | Text is cleaned and tokenized |
Task framing | Test instructions by developers shape perceptions | Pre-training objectives by model designers shape model design |
World view source | Humans draw on self knowledge from personal learning, and cultural experience | Models acquire knowledge from shared cultural data via web scraping |
Stimulus interpretation | Humans activate self knowledge in response to test items and encode self descriptiveness as ratings | Model approaches items using distributional knowledge of language encoded in model weights |
Matrix representation | Individual responses collected to and turned into a correlation matrix | Embeddings are collated and turned into a cosine similariy matrix |
Matrix interpretation | Shared human self knowledge is reflected in the correlational patterns | Shared corpus languge usage is reflected in the cosine similarity patterns |
Analysis | Psychometric analysis via latent variable models (e.g. Factor Analysis, IRT) or network models | Psychometric analysis via latent variable models (e.g. Factor Analysis, IRT) or network models |
Much to learn but the future is bright
The substitutability assumption relies heavily on analogy rather than scientific evidence. It is unclear what properties of embeddings let them capture trait related variance, what the effects of training data composition are, and whether fine-tuning can bring training data closer into line with assessment populations. The field does not understand these processes very well yet. Despite not knowing the precise mechanisms through which embedding and empirical psychometrics correspondences occur, the parallels offer powerful psychometric capabilities.
Next section
Last section
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).