Semantic item alignment

Pseudo-discrimination
A better name is ‘semantic alignment’
Human item assignments are another sensible criterion
Calculating semantic alignment indices
Step 1. Choose a sentence transformer
Step 2. Encode items and constructs
Step 3. Calculate cosine similarities
Step 4. Subsetting items for pseudo factor analysis

Pseudo-discrimination

It is possible to use large language models to examine item validity. We have proposed that the cosine similarity between the embedding of an item and the embedding for its parent construct definition could be examined as an indicator of item construct alignment, which we initially called pseudo discrimination. We showed that these cosine similarities were significant predictors of factor loadings.

A better name is ‘semantic alignment’

We described these values as pseudo-discrimination as they can approximate empirical discrimination. We have received feedback questioning the reference to item-construct cosine similarities as pseudo-discrimination parameters, because discrimination is about differentiation between individuals at different levels of a trait, it is not about shared language.

Given discrimination is an item to construct relation, and the construct is defined by all items, for items to discriminate they need shared meaning with each other. If that’s not the case, people will not respond similarly across related items and there will be no item total association. Moreover, for valid discrimination, items should also have shared meaning with the construct definition. In other words, discrimination is about statistical differentiation, but what is differentiated also matters.

Nonetheless, we are suggesting only that semantic alignment between items and constructs is a pre-requisite for valid discrimination, it is a necessary but not sufficient condition for valid discrimination. It is true then that this item construct alignment is not the same as discrimination. It may be better that these values are referred to as ‘semantic alignment indices’ rather than pseudo discrimination parameters to avoid any confusion.

Human item assignments are another sensible criterion

Another comment received was about the small to moderate correlations with empirical discrimination parameters shown in early work. Perhaps a better choice of validation criteria is human item assignments. Given we’re suggesting that shared semantic interpretations are a prerequisite for valid empirical discrimination, we still recommend including empirical discrimination in the nomological network of this metric. However, we agree that checking correlations between human determined category memberships and the pseudo discrimination values for each construct is a sensible validation step.

Calculating semantic alignment indices

Step 1. Choose a sentence transformer

The type of transformer we used is a specialized model called a sentence transformer. A sentence transformer is a pre-trained model that encodes the scale statements numerically using a masked prediction approach to predict words in the item from all other items in the sentence. There are many sentence transformers available and many of these are hosted at www.huggingface.com, a hub for hosting A.I related models for text and other modalities (e.g., audio, vision).

When choosing a sentence transformer, researchers will often evaluate the performance of different models based on publicly available leaderboards of results from benchmarking tests. It is also common to consider the overlap between the training data for the model and the text that you wish to you it on to ensure it is sufficiently representative. There is a lot of discussion about biases that may emerge where the training data does not match the data you wish to encode (i.e. both text type, and characteristics of the populations that generated the text).

Another consideration is the carbon footprint of the models, for instance, including the renewable energy commitments of the model providers. At this point, there is often no strong rationale for choosing one model over another for a specific task. Instead, people will try multiple models in a check-it-and-see style approach to see which models work best for a given task. In our early work we have also reported on the use of ensembles of sentence encoders. Here we will use the MiniLM sentence encoder.

Step 2. Encode items and constructs

There is little or no pre-processing required for encoding items, other than to enter the items and construct definitions into a text or .csv file for importing into the Jupyter notebook. The model important question is whether the construct embedding should be based on the construct name, the construct definition, the concatenated string of all items, or some combination of the three.

In practice, we recommend concatenating the construct label with the construct definition as the to form the construct embedding that is paired with the item embeddings to check pseudo-discrimination. We do not recommend using items (excluding the studied item) as part of the construct embedding because it mixes internal and external construct related criteria.

Step 3. Calculate cosine similarities

Cosine similarities are often preferred as a measure of association in natural language processing because it assesses via association via the angle between the vectors as opposed to the length of the vectors.

Cosine similarities are also preferred because they are stable in very high dimensions, which is a characteristic of many modern embeddings methods, particularly using transformers. The code for generating the cosine similarity between every item and every construct is provided below.

Step 4. Subsetting items for pseudo factor analysis

Based on the pseudo-discrimination parameters, we wish to choose subsets of the generated items that are maximally similar to their target construct definitions and minimally similar to any other construct definition for further investigation. These criteria will ensure that the items are indeed related to their construct definitions, a metric not available in classical test theory, factor analysis, or IRT.

To select items that meet these criteria, we calculate two values for each item. The observed cosine similarity for an item minus the maximum cosine similarity of the item with any other construct definition, and the observed cosine similarity between a given item and the mean similarity of that item with all other construct definitions. From this information we can select well performing items to take to the next stage, which is checking dimensionality.

Next section

Semantic item alignment results

Last section

Model fine tuning

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).