Negative control: If you can’t simulate, destroy!

Negative control: If you can’t simulate, destroy!

Psychometrics based on learned representations is a new subfield and ongoing research is needed on conditions when the techniques work well and when they don’t. Examinations so far test embedding structures against a corresponding human response model, rather than a generative model, or against face valid scale memberships (i.e., do methods recover what we told LLMs to generate) rather than recovery of a true population model. Early demonstrations using these strategies reveal strong correspondence between semantic and human response parameters and that methods can recover face valid structures.

Parameterization mismatch

Ideally, however, we would like use a Monte Carlo approach where a known generative model is used to simulate and recover the model. In embedding psychometrics a generative psychometric factor model does not yet exist (nor does a true network model, or any other conventional model used in psychometrics) because of a parameterizarion mismatch: the generative model is autoregressive. Embedding based Monte Carlo simulation in psychometrics needs a generative measurement model, but embeddings don't have one that is parameterized in psychometric terms, such as factor analysis, item response theory, or a network based approach. 

Negative control robustness designs

We need a way to check whether the model being recovered is not a method artefact. While we cannot easily run a true Monte Carlo simulation, we can implement a negative control robustness check. We start with real embeddings that have an expected rational or theoretical structure, then progressively destroy the signal. If structure observed in the theoretical framework is down to genuine semantic structure, destroying it should break the factor recovery. Importantly, however, in doing so we have switched from our ideal of comparing a recovered model to a generative model to comparing the observed model to the intended or theoretical model. The consequences should not be overlooked and we return to what these are later.

Small experiment

First, however, let’s try an experiment. The code to reproduce these results is available on n GitHub. We use item text for the 50 IPIP big five marker items, the big five as our theoretical model and one encoder, a small language model. Items were first embedded with MiniLM. Cosine similarity between all item pairs produces a 50×50 matrix. Pseudo factor analysis on that matrix recovers all five personality factors cleanly. Now we replace all items for one scale’s embeddings with random Gaussian noise and factor analyze the resulting matrix and check recovery. Then we repeat, replacing one more factor with noise at each step. We factor analyze each step using a method like maximum likelihood with oblique rotation.

Dominant average absolute loadings

We can use the Dominant Average Absolute Loading (DAAL) to assign extracted factors to theoretical counterparts. DAAL automatically assigns each extracted factor to whichever theoretical factor its items load on most strongly, on average. A factor is considered “recovered” only if its mean absolute loading (DAAL) is highest for an extracted factor and it exceeds a minimum threshold. A floor is needed because DAAL will always find a winning assignment, even in pure noise. The floor is what distinguishes a real factor assignment from a spurious one. We set the DAAL at .25, which is matrix specific, because at and below this level there are no recovered factors when the matrix is pure noise.

DAAL advantages of simple counting

Simple counting for factor assignment is difficult when interpreting pseudo loading matrices because they are often complex. Moreover, a loading of .90 counts for no more than a loading of .30, and marginal cross-loadings can outvote a strong anchor item. The DAAL criterion weights by magnitude throughout, reflecting overall factor saturation. DAAL is also useful because factor analysis returns factors in arbitrary order (or based on size, more specifically), so you need an objective rule to match extracted factors to theoretical ones.

Results

Results reveal that the factor structure collapses monotonically as noise progressively replaces semantic signal : 5, 4, 3, 2, 1, 0. The similarly matrix panel plot shows factors dissolving factor by factor as noise is added. In other words, we systematically destroyed the structure to prove it exists. We could could also check the destruction is invariant over different ordering of noise injection and trying embedding dimension order randomization as perturbation because Gaussian noise degrades the distribution while embedding order randomization preserves the distribution but destroys the alignment between dimensions across embeddings. In this case, the results also showed monotonic degradation in recovery under both of these approaches.

image

Fine print

It worth being precise about what this check does and does not show. In conventional Monte Carlo simulation, you compare recovery against a true model you generated to know that the method recovers the true model but we could not yet do that here. In standard PFA, you compare against an empirical model from real respondents which is important but does not rule out the recovered structure is an artefact. Here we check whether the recovered structure depends on semantic signal rather than whether the generating structure was recovered. These are all different things. However, a true Monte Carlo study for embedding psychometrics would require constraining the autoregressive generative process of the language model to produce text with a known factor structure. Until this circle is squared, negative-control robustness checks are a useful strategy. 

Questions raised

The approach raises many questions, such why we bother to pursue a bridge between machine learning and conventional measurement models and do not pursue altogether new machine learning modelling approaches to structure recovery; or why we pursue a psychometrics based on learned representations rather than simulated LLM responses. As we explore elsewhere in this book, the answer to the first question is interpretability and comparability and the answer to the second is that while both are ultimately important, representations are closer to the underlying model than model generated behavior.

Next section

Caution on multiple encoder pseudo-MTMM

Last section

Scoring (3 of 3): Neural contrastive pairwise regression (NCPR)

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

image
Google scholar profile for Nigel Guenole - AI psychometrics research
Linkedin profile for Nigel Guenole - AI assessment consulting and strategy