LLM Bias in Psychometrics: Measurement Invariance, Adverse Impact & Fairness

Invariance across languages and transformers, before data
Pseudo configural invariance
Pseudo metric invariance
Pseudo scalar invariance
Further considerations

Invariance across languages and transformers, before data

A natural extension of the pseudo factor analysis and pseudo factor analysis with intercepts work we present earlier is checking measurement invariance, or whether scales function similarly in different contexts, before you collect any human response data. The developments mean you can potentially check if your translations work the same way across two or more languages or examine the closeness of the geometry in different AI models. Let’s try an experiment. You can get the code to recreate these outputs from GitHub.

Pseudo configural invariance

We take a short 8-item scale measuring eccentricity. Translate the items into Spanish using Google translate. We fit a multiple group model using pseudo factor analysis with means in R with Lavaan. If the scale works equally well across languages, equating the loading and intercept parameters won’t impact item residuals in tests of loading and intercept invariance. The table and figures below shows that constraining the loadings and the intercepts across all of the scale items doesn’t materially impact the residuals, as expected. This indicates that semantically, the scales are measuring similarly across languages.

CORRECT - faithful Spanish items

Latent mean difference (ES - EN, standardised): -0.630

item	met.ld.EN	met.ld.ES	sca.in.EN	sca.in.ES
SCUP104	-0.21	-0.007	-0.009	0.006
SCCD71	-0.157	0.022	-0.084	0.07
SCUB69	-0.056	-0.055	0.103	-0.13
SCEC34	-0.136	0.009	0.053	-0.048
SCDS108	-0.159	0.085	-0.197	0.152
SCUP32	-0.17	0.018	0.077	-0.063
SCEC106	-0.042	-0.075	-0.1	0.13
SCDS72	-0.048	0.034	-0.119	0.111

Pseudo metric invariance

Let’s corrupt the scale by making one item’s Spanish counterpart an item from a different scale. We will pair the English item “have bizarre experiences” with a Spanish translation of an antagnomisn item. The item “See how far I can push people” is translated to Spanish as “Vea hasta dónde puedo presionar a la gente” and used in the model.

That changes the construct to item relation across English and Spanish for this item pair and should be detectable as residual misfit by a metric invariance constraint, which the table below shows it is. As a quick sense check, let’s see that the loading in the freely estimated model is really estimated higher for the correct Spanish item. The loading for the correct item was .88 while the loading for the corrupted item was .21, so the experiment is working as expected.

LOADING CORRUPTION - SCUP32 replaced with wrong-construct Spanish item

Latent mean difference (ES - EN, standardised): -0.649

item	met.ld.EN	met.ld.ES	sca.in.EN	sca.in.ES
SCUP104	-.21	.01	-.02	.01
SCCD71	-.17	.04	-.09	.08
SCUB69	-.07	-.03	.10	-.12
SCEC34	-.17	.04	.05	-.04
SCDS108	-.17	.10	-.20	.15
SCUP32*	.04	-.37	.13	-1.57
SCEC106	-.05	-.05	-.10	.14
SCDS72	-.05	.03	-.12	.11

Pseudo scalar invariance

Let’s put the correct scale back together and this time dial back the extremity of an item in Spanish, but leave it the same in English. We’ll change “Seem eccentric to other people” to the Spanish translation of “Seem original to other people” or “Parece original para otras personas.” That’s an item location change. We create the mean proxies using item embedding projection onto semantic intensity vectors as described earlier in this book.

This change should be detectable by residual misfit for that item pair in a scalar invariance test while leaving the loading test residual mostly unaffected. The table below shows that it is. As a quick sense check, let’s check that the faithful Spanish translation of Parece excéntrico a otras personas has a higher projection than the dialled back Spanish translation of Parece original para otras personas. Inspection shows that it does. The original mean proxy was 1.51 while the second mean proxy was .66. There was a small latent mean difference across the groups here that I refrain from interpreting for now.

INTERCEPT CORRUPTION - SCEC34 replaced with lower-intensity Spanish item

Latent mean difference (ES - EN, standardised): -0.689

item	met.ld.EN	met.ld.ES	sca.in.EN	sca.in.ES
SCUP104	-.21	.00	-.03	.03
SCCD71	-.18	.04	-.11	.09
SCUB69	-.06	-.04	.09	-.11
SCEC34*	-.02	-.12	.27	-.65
SCDS108	-.16	.09	-.21	.16
SCUP32	-.19	.03	.05	-.04
SCEC106	-.05	-.06	-.11	.15
SCDS72	-.05	.04	-.13	.12

Further considerations

One reason to prefer pseudo projections onto semantic axes over ML mean prediction models is because the projection compares embeddings as whole structures rather than using arbitrary weighted combinations of embedding dimensions. The full measurement framework emerges from a unified embedding space. The predictive model, however, might still give an idea of the upper bound of what geometric methods will achieve. This upper bound may be dependent on encoder quality. In early trials, we found encoder size had opposing effects on loading compared to intercept recovery.

Many choices here are unsurfaced and required judgment, so future simulations are important to see how well the method works and to identify failure modes. e.g., for intercepts, we used the English scale anchors for a common semantic intensity projection across languages. As well as simulations, it would also be possible to fit semantic models to scales with known empirical DIF and see if this method catches those items. Unless checking that tems are similarly represented across transformers it makes sense to use the same model across languages.

It is possible to use a different encoder for the cosine matrix and the mean vectors and this may be attractive if only looking at item locations and different transformer models worked better for cosine versus mean recovery. However, the trade-off is that doing so would break the claim that the loadings and intercepts emerge from different operations on the same embedding space and would make across group comparisons difficult to interpret. Remember that we use residuals because N- based fit measures are not possible, there was no sample.

So it appears you can check scale translation bias with AI and before response data. Of course, this checks semantic relationships, not response processes. Psychometrics evolves slowly for good reasons: decisions based on psychometrics have big consequences. Keep in mind that this is an early stage idea. The phrase “before response data” is used instead of “without data” because to full test for psychometric bias with humans response data is always eventually needed.

Hand in hand with this acknowledgement it is worth noting that the utility bar for this approach is not whether it is as effective empirical DIF approaches, which are always necessary and this doesn’t replace. The utility instead is practitioners who are quickly writing scales that otherwise won’t get evaluated before use. This is common with opinion surveys and 360-degree feedback in talent management. It may also be the case that the method detects invariance that is not picked up in empirical invariance tests.

Next section

From vision to Large Multi-Modal Models (LMMs) for visual reasoning parameter pre-knowledge

Last section

Pseudo-factor analysis with mean structures

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

Pseudo-configural, metric and scalar invariance

Invariance across languages and transformers, before data

Pseudo configural invariance

Pseudo metric invariance

Pseudo scalar invariance

Further considerations