- Why choosing the right model matters
- #1 Language understanding or generation
- Decoder models
- Encoder models
- Encoders or decoders for some tasks
- #2 Determinism and reproducibility
- Token sampling strategies
- Making decoders deterministic
- #3 Open or closed models
- #4 Local or cloud-based inference
- #5 Pre-training or inference scaling models
- References
The considerations I discuss here come prior to the important process of evaluating the reliability and validity of the content, parameters and scores that we use language models to obtain. They ask whether a model is operationally acceptable even before we begin tests regarding the reliability, validity, and bias in AI psychometrics. Psychometric considerations about language model use are discussed in later sections of this book.
The criteria I discuss go beyond common criteria of benchmarks, verifiers, leader boards, and LLM judges. For an overview of these LLM considerations see Sebastian Raschka's article. See also Bulut et al. (2024) and Casabianca et al. (2025) for related considerations on LLM selection in psychometric contexts.
Why choosing the right model matters
Selecting a language model appropriate for your psychometric task is an important preliminary step in AI psychometrics because, the model you choose has implications for outcomes such as privacy, security, cost and the impact of your work on the environment, as well as psychometric criteria like faithfulness, transparency, reliability, validity, and bias.
The guidance currently available typically focuses on computer science criteria related to accuracy. But when it comes to using transformers for psychometric measurement, advice is limited. Here I discuss the 5 criteria that should be considered before we get to analysis of the model accuracy and reliability and validity in psychometrics.
- Language understanding or language generation,
- How reproducible models outputs need to be,
- Open and closed models,
- Local and online models,
- Standard and extended capability models.
#1 Language understanding or generation
Decoder models
Tasks that involve generating new text require either encoder-decoder, or more commonly today, decoder models with encoder-decoders architectures appearing less commonly today in recent commercial offerings. Encoder-decoder models use bidirectional attention for encoding and causal attention plus cross-attention for generation.
Decoder-only models use causal attention, meaning they process tokens sequentially and can only attend to previous tokens when generating output. Both designs are often called Large Language Models (LLMs) and are the transformer models behind major commercial releases like GPT, Claude, Qwen, and Grok. Decoder-only models often require more training than encoder models of similar capability (more data, more passes through datasets, more compute).
Encoder models
Tasks that involve representing language as numerical vectors for use in subsequent analyses require encoder models trained for language understanding. Examples include using text embeddings to predict item parameters like difficulty and using similarity matrices as input for factor analysis and network models.
Decoder embeddings can be extracted and can perform well on many benchmark tests because of their extensive pre-training. However, encoder models, sometimes called small language models (SLMs, although it is possible to have a larger encoder and a smaller decoder) are sometimes preferred for their transparency. Encoder models are trained using bidirectional attention where tokens can attend to other tokens both before and after them in the sequence. This gives encoders full context about the entire sequence and means they perform well where this understanding is needed, such as classification tasks.
Encoders or decoders for some tasks
Complexity arises because some tasks can be handled from either an encoder or a decoder perspective. Consider the example of predicting item difficulty. One way is use natural language to generate embeddings for each of the items and to use the embeddings in a machine learning model to predict item difficulties.
The model parameters from a prediction model can be saved to predict the difficulty of items the model has not yet seen. Alternatively, LLMs can be used. In this set up, estimates of difficulty parameters can be obtained by zero shot or few shot prompting. A critical factor in deciding between approaches is how much transparency is required.
If we are focused on accurately predicting item parameters, the candidate is relatively insulated from the effects of the AI. Here we not be as concerned about how AI is used to arrive at its estimates, so long as they are accurate. We might use encoder or decoder models based on which turns out to be most accurate as decoders can perform encoder tasks (e.g. Weller et al., 2025).
On the other hand, in scoring situations for constructed responses, candidates are directly touched by the AI. Here transparency is paramount and we may prefer to use an encoder for embeddings in a model that predicts human scores over an LLM that is more-opaque. Then again, if a solid understanding of how LLMs arrive at their scores can be obtained via explainable AI (XAI) methods LLMs may be acceptable.
#2 Determinism and reproducibility
Reproducibility is a core consideration in psychometrics and data science more broadly. Encoders are more deterministic than decoders by their nature. Once trained they involve fixed matrix multiplications in the forward pass that always produces the same output for the same input. This them a more natural choice when reproducibility is critical. Decoders often involve randomness at the generation stage after the Softmax function determines token probabilities but this can be controlled with certain token sampling strategies.
Token sampling strategies
Common token sampling strategies include greedy decoding (selecting the highest probability token), beam search (exploring multiple high-probability sequences), top-k sampling (randomly selecting from the k most likely tokens), top-p/nucleus sampling (sampling from tokens whose cumulative probability exceeds p), and temperature scaling (dividing the logits by a temperature value before applying SoftMax to make larger logits more probably).
Making decoders deterministic
Decoders can be made more deterministic by setting settings random seeds, setting temperature to zero, and using greedy so that the highest-probability tokens are deterministically selected at each step. Even then, however, both encoders and decoders are subject to random fluctuation due factors such as low-level hardware configurations, floating point precision and non-deterministic operations in deep learning frameworks.
#3 Open or closed models
The open versus closed model distinction includes on a wide variety of elements. These include whether the model is community owned and governed, the precise data that were used in training the model, whether the model’s internal parameters are available for inspection, and whether there are any restrictions on the model’s use. Many models claimed to be open models are criticised for failing one or other of these criteria. Among the most open are the Olmo models from the Allen Institute.
From a psychometric perspective, analysts need to know that models support versioning for reproducibility won’t be subject to arbitrary commercial considerations (e.g., removing access in an update). Knowing what data were used to train the model so that the correspondence between training and application can be assessed. Access to the model weighs allows methods such as integrated gradients for exploring AI scores. Restrictions on model use, for instance, any commercial restrictions must be known before deployment.
#4 Local or cloud-based inference
It might at first seem that local models and cloud-based models distinction overlaps with the open versus closed distinction and it some cases it does but it is not always the case that open models run on the cloud and closed models always run locally. Several open models like Meta’s Llama and Mistral are run in the cloud.
In cases where psychometric professionals want to maximise security, they may choose to run models locally. This may be valuable in high stakes credentials programmes where security or even psychometric governance constraints mean items cannot be shared with external organisations such as model vendors during development or deployment.
Locally run models can also have cost advantages as they do not require API calls, but that advantage must be offset against the increased hardware and electricity costs. With respect to these electricity costs for online providers, it may be possible to evaluate them for the proportion of energy they consume that was recycled, a point noted by Bulut et al (2025).
There don’t currently appear to be closed models that run locally and don’t phone home in the same way that other statistical psychometric software like MPlus does. The most secure way for vendors to keep their models secure is to keep the model on their secure servers. Local deployment may be required to prevent test-taker data from being processed by third-party model providers under legislation such as GDPR.
#5 Pre-training or inference scaling models
The last key consideration I highlight when choosing a language model in psychometrics involves whether to use models that generate responses directly from pretraining and RLHF, models that add extended inference-time reasoning like Claude, OpenAI’s o1 or Grok-4 (see Raschka, 2025), and whether to augment either approach with RAG for the latest available information. For straightforward content generation like producing multiple choice items, models like the Llama series models will often suffice.
For tasks requiring complex reasoning, for instance whether a novel item truly measures the intended psychological construct, models with extended reasoning from Anthropiic or OpenAI become valuable. Ideally the reasoning willl be visible. Transparency matters because psychometricians need to audit the model’s logic to ensure it hasn’t introduced construct-irrelevant variance. RAG capabilities become relevant when generating items requiring current knowledge, such as science questions about recent discoveries or social studies items referencing contemporary events that post-date the model’s training.
References
Bulut, O., Beiting-Parrish, M., Casabianca, J. M., Slater, S. C., Jiao, H., Song, D., Ormerod, C. M., Fabiyi, D. G., Ivan, R., Walsh, C., Rios, O., Wilson, J., Yildirim-Erbasli, S. N., Wongvorachan, T., Liu, J. X., Tan, B., & Morilova, P. (2024). The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges. arXiv preprint arXiv:2406.18900.
Casabianca, J. M., McCaffrey, D. F., Johnson, M. S., Alper, N., & Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv preprint arXiv:2501.02334.
Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d'Alché-Buc, F., ... & Larochelle, H. (2021). Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). Journal of machine learning research, 22(164), 1-20.
Pisoni, G. (2025, November 3). Toward community-governed safety. Hugging Face Blog. https://huggingface.co/blog/giadap/community-safety
Raschka, S. (2025, October 5). Understanding the 4 main approaches to LLM evaluation (From Scratch): Multiple-Choice Benchmarks, Verifiers, Leader boards, and LLM Judges with Code Examples. Ahead of AI magazine. https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches?
Raschka, S. (2025, October 5). Build a Reasoning Model (From Scratch). https://sebastianraschka.com/books
Weller, O., Ricci, K., Marone, M., Chaffin, A., Lawrie, D., & Van Durme, B. (2025). Seq vs seq: An open suite of paired encoders and decoders. arXiv preprint arXiv:2507.11412.
Next page
How LLMs learn (1 of 4): Tokenization
Last page
What are encoder-decoder architectures?
Home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).