In this section we discuss and demonstrate Artificial Intelligence (AI) enhanced methods used in the applied practice of industrial organizational (IO) psychology for non-cognitive scale design. These are insights from applied practice and studies on the efficacy of LLMs for item generation are only beginning to emerge. Early results from Hernandez and Nie (2023) and Lee et al. (2023) showed promising psychometric results for LLM-generated items.
In this conceptual section and the next technical sections on item generation, we use the case example of item generation for a measure of executives’ moral foundations. We will describe and then apply these methods to item generation for the Executive Moral Foundations questionnaire. We show the step-by-step procedure for item development via API calls for various item generation strategies.
- Decision 1. Web user interface or API?
- Intellectual property
- Bias in generated items
- Data security
- Decision 2. Direct versus guided item generation
- Decision 3. Zero shot versus few shot item generation
- Item generation strategies
- Decision 4. Extensions including RAG and fine-tuning
- Quality checks
Decision 1. Web user interface or API?
The first decision is whether to user a web user interface (UI) such as Claude or ChatGPT, or to make calls to an API via a notebook like Jupyter, an integrated development environment such as Visual Studio Code, or a cloud-based environment like Google Collab. The scale of the project here is easily manageable on local machines. While a web UI is okay for fast prototyping on small projects, API calls give increased security and control.
API calls minimise the prompting needed (e.g. when generating large numbers of items where output exceeds the model’s token limits), ensures consistent parameters across sessions that is not ensured via the web UI, and gives access to more parameters that are likely to affect quality, such as temperature (controls randomness in text generation), top_p (limits to top probability tokens for coherence), and max_tokens (a hard cap on output length). The API route requires some coding; however, it is manageable for most who are familiar with coding basic analyses in familiar statistics software.
Intellectual property
It's important to think about intellectual property (IP) implications when using generative AI for writing items. Content from commercial AI models could be subject to licensing restrictions, or ownership of items may not automatically be assigned to the user. Reusing items from publicly available instruments, even as training data, may carry risks related to originality and copyright. Legal review of IP rights and model terms of service is recommended before operational use.
Bias in generated items
AI-generated items may reflect biases present in the models or the data they were trained on. This creates possible risks relating to fairness with regard to protected characteristics (e.g., gender, ethnicity, age). It is important to hold subject matter expert reviews of generated items for any biased content and to conduct fairness analyses (i.e., measurement invariance, adverser impact) during empirical validation to ensure compliance with ethical and legal standards.
Data security
If using cloud-based models and APIs, the user prompts and generated content are likely to be stored or processed on external systems. This can raise data privacy and security concerns, particularly for sensitive assessment content where concerns about item security are paramount. It is important to review your AI provider’s data handling policies and to choose approaches that ensure confidential and secure handling of inout prompts and all generated materials.
Decision 2. Direct versus guided item generation
To generate items via a notebook like Jupyter or integrated development environment (IDE) like Visual Studio, prompt engineering methods are commonly used. Methods include direct item generation, where we instruct the LLM to generate items measuring the focal construct without constraints, and guided item generation methods, where we provide detailed instructions about item requirements, such as construct definitions, item templates, and other constraints that might be necessary such as item polarity.
In other words, direct item generation is a barebones prompt approach while guided item generation is a highly engineered prompting approach that specifies item characteristics, which can be with or without examples. Guided item generation is generally preferable. Without detailed instruction, results are likely to be poor, with weak content validity, inconsistent item quality, and the risk of construct drift (items that deviate from measuring their intended constructs).
Decision 3. Zero shot versus few shot item generation
Under each approach, we can give examples items or omit example items to the LLM. These are simply included in the almost natural language prompting with minimal technical scripting. If no item examples are available or given, the approach is considered zero-shot prompting.
Zero shot prompting gives very little control over the nature of the items that are created and requires stronger human oversight than when examples are given. It may nonetheless be preferred if item variety is a key focus such as in the early stages of scale design.
If we do give examples of the items to be generated, we refer to the method as few-shot prompting. Few-shot item generation, particularly when a guided approach is used, generally produces useable results. The examples in few-shot prompting ground the model in the context of the task at hand.
By giving the model a few examples you show it linguistic features you expect to see in strong candidate items. While few shot prompting give less diversity than zero shot prompting, it tends to give more diversity than fine tuned models, particularly when the fine tuning is on narrow training data. With few shot prompting there is also no need to train or maintain specialized models.
Item generation strategies
Combining the possibilities relating to the level of guidance provided and whether or not examples are given leads to four possibilities that we refer to as label based item generation, correspondence oriented item generation, constraint-driven item generation, and blueprint based item generation. These possibilities vary in the level of control the user imposes on the AI model during item generation process with higher control leading to more item consistency and less model creativity lower control leading to less item consistency and more item creativity. In more exploratory stages of construct development label based item generation may be preferred while in more formal item generation settings the greater control of blueprint based item generation may be preferred.
Barebones prompt | Detailed prompt | |
Zero shot | Label based item-generation | Constraint-driven item generation |
Few shot | Correspondence-based item generation | Blueprint-based item generation |
Decision 4. Extensions including RAG and fine-tuning
We note that the LLM instructions can be further tailored with examples from searches of databases or existing items, referred to as Retrieval Augmented Generation (RAG). RAG approaches are similar to few-shot prompting in that they both provide examples, but with RAG, the examples result from a search for instruction relevant information. Examples are not hard coded or directly user provided as they are in the guided item generation approach.
It is also possible to fine tune encoders and encoder decoders to generate better items if high quality examples are available. Fine-tuned item generation involves adjusting a model’s internal weights by training the sentence encoder to predict words in example items using prompt - item pairs. This can be a time and computationally intensive process and there is a significant risk of overfitting where the model becomes to tailored to the training data set. We’ll address this topic after we have items with a clean item construct alignment and a clean factor structure, as these are needed inputs data for the fine-tuning process.
Quality checks
We will give instructions regarding item features such as item length (e.g, ‘items must be no more than 10 words per item’), reading difficulty (e.g., ‘only generate items at grade x reading level), and polarity or trait level / location requirements (e.g., generate x positively phrased and y negatively phrased items). However, items will not always match your criteria exactly.
It is important to check whether items meet required criteria we have specified in our prompting and examples to steer the model generation. This can occur as a series of constraints during the item generation itself, or alternatively, items can be checked with a prompting approach post generation. If the number of items is small (e.g. hundreds) it is feasible to check these manually.
Next section
Last section
Semantic convergent and discriminant validity
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).