- AI psychometric agent overview
- Tool access: semantic alignment
- Memory: storing item generation failures
- More tooling: factor analysis and iteration
- ReAct versus DAG
- ReAct planning within a DAG
- Single agent versus multi-agent systems
- So where are we now?
- References
An artificial intelligence (AI) agent takes a Large Language Model (LLM) and adds three core components, access to tooling, memory, and planning. But what exactly does this mean in practice?
While agents might be used to accomplish many psychometric tasks, let’s consider a simple agentic psychometric workflow to design new psychometric scales. We will sketch out the different tasks the psychometric agent would need to complete and design choices we face.
This section will conclude that AI agents have high potential utility in psychometrics but, for now at least, their value is unproven. The best agent is still an experienced human with access to the various AI components and who can exercise expert judgment in scale design.
AI psychometric agent overview
The first goal of our psychometric agent will be to generate items against user specified construct definitions. We recognize that the quality of user definitions provided places an upstream constraint to on the success of the entire agentic workflow.
Guardrails regarding item structure and related style requirements can be incorporated at this stage in the prompt. These specifications could include requirements for readability and cultural sensitivity, for example. Extensions such as RAIG based item generation can also be deployed.

Tool access: semantic alignment
Next, the agent will check the semantic alignment indices to ensure item-construct alignment. This introduces the first feature that differentiates an LLM from an agent: access to tools that are not natively part of the LLM. We will use standard machine learning packages for the tooling.
The tools we will specify in our agent design are sentence transformers for the embeddings and Scikit learn for checking cosine similarity. The agent will check semantic item alignment, or the cosine similarity between generated items and user-defined constructs.
Memory: storing item generation failures
To demonstrate the second hallmark of agents, memory, the psychometric agent will store items that fail the alignment check based on an acceptable lower and upper threshold, along with reasons for the observed failure to guide future item generation.
The options for agent memory at this stage are software level choices. For example, we could append what earlier agent iterations learn at the beginning of the prompt, we could store them as a list, or we could make them a persistently accessible data source.
Of these options, a prompt-based approach to memory is the simplest of the memory options but it is expected to degrade over long loops. Persistent retrieval is expected to scale better but can introduce its own noise during lookup.
More tooling: factor analysis and iteration
Finally, the agent must use further tooling to examine the factor structure using pseudo factor analysis (Guenole et al., 2025). It could generate new items from the beginning of the workflow if the factor structure is unsatisfactory based on residuals checks, see Suárez-Álvarez et al. (2026).
This step is likely to be the most fragile node in the workflow and will require a hard iteration cap. As I emphasize everywhere I discuss pseudo-factor analysis, construct validity is not only semantic and regular test standards must ultimately be met.
There might be a real gap between semantic coherence and construct validity. There are many possible extensions to this workflow. For example, it could be adapted to include artificial crowd responses to the new items for analysis.
ReAct versus DAG
With these goals and stages as backdrop, there are two agentic approaches that might be followed. The first method is the Reason and Act (ReAct) framework (Yao et al., 2023), which is more suited when path to the goal is less structured and clear.
The second is a Directed Acyclic Graph (DAG) based method that leads the agent through a well understood path. Here we’ll use the DAG approach because we have specified a clear sequence of events, but we will see, we can still incorporate ReAct functionality.
ReAct planning within a DAG
What usually distinguishes an AI agent from a tool-augmented pipeline is planning capability that chooses when to use tools, update memory, and decide the next step. To incorporate planning, within each node, a ReAct-style inner loop can be added that lets the LLM reason before committing to the next step.
The LLM planning (or reasoning) can check if items are diverse enough, whether borderline alignment reflects poor items, whether failure is due to the construct definition, or whether a “poor” factor structure requires regeneration or reflects acceptable multidimensionality. In the case of poor factor structures, the planning loop can regenerate items from the start.
Single agent versus multi-agent systems
A further consideration is whether the workflow needs only a single agent or a multi-agent system with specialist roles and handoffs between agents for different tasks. A single agent maintains shared context over all steps but its performance may degrade over long loops.
Multi-agent orchestration of different agents for different sub-tasks keeps each node focused, but the agents must be coordinated and there may be information loss at handoff points between agents. Orchestration frameworks such as LangChain, LlamaIndex, and LangGraph can manage the agent coordination.
So where are we now?
For all the talk of AI agents, I have not yet seen a start-to-finish agentic workflow resulting in a robust semantic psychometric structure ready for SME reviews or human pre-trialling. Partial implementations of steps exist as well as pipelines without agents (e.g., Lee et al., 2025, Russell-Lasalandra, 2025), but there don’t appear to be fully autonomous implementations.
Right now I’m using these methods separately to quicken the scale design process in many consulting engagements but have not automated it entirely and the necessary conditions we have discussed for success (i.e., a clean semantic factor structure) are stringent.
Still, if there is any approach that can find a clean semantic pseudo factor structure in a haystack, AI probably can. There is some reason for optimism even if it has not emerged yet. Once again, the real test is always the reliability and validity of actual human responses to AI designed scales.
References
Guenole, N., D’Urso, D. E., Samo, A., Sun, T., & Haslbeck, J. M. B. (2025). Enhancing scale development: Pseudo factor analysis of language embedding similarity matrices. PsyArXiv. https://osf.io/preprints/psyarxiv/vf3se
Lee, P., Son, M., & Jia, Z. (2025). AI-powered automatic item generation for psychological tests: A conceptual framework for an LLM-based multi-agent AIG system. Journal of Business and Psychology, 41(1), 71–99. https://doi.org/10.1007/s10869-025-10067-y
Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. https://arxiv.org/abs/2304.03442
Russell, S., & Norvig, P. (2021). Artificial intelligence: A modern approach (4th ed.). Pearson.
Russell-Lasalandra, L. L., Christensen, A. P., & Golino, H. (2025). Generative psychometrics via AI-GENIE: Automatic item generation with network-integrated evaluation. PsyArXiv Preprints. Advance online publication. https://doi.org/10.31234/osf.io/fgbj4_v2
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
Suárez-Álvarez, J., He, Q., Guenole, N., & D’Urso, D. (2026). Using artificial intelligence in test construction: A practical guide. Psicothema.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. International Conference on Learning Representations. https://arxiv.org/abs/2210.03629
Next section
Crash course in transformer-era automated scoring
Last section
Retrieval Augmented Generation (RAG)
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).