Understanding Outcome Biases in AI: Fairness Across Demographic Subgroups

Psychometric bias analogies to LLMs
Measurement invariance analogy
Differential validity analogy
Adverse impact analogy
Reliability, validity and candidate experience including perceived fairness
Relevant AI strategies for minimizing bias
Preventative methods (Wang et al., 2023)
Detection methods (Wang et al., 2023)
Psychometric call to action

Psychometric bias analogies to LLMs

This section focuses on biases that reflect differences in model outcomes across subgroups, rather than biases due to model misspecification or optimization affecting token probability estimation uniformly across all data. While estimation biases concern a model’s internal statistical properties, the outcome biases we discuss here occur as systematic differences across demographic subgroups and raise fairness concerns, even where parameters are well-estimated.

Among the interests that psychometric practitioners express when presented with the opportunity to use LLMs is to ensure the training data and model performance does not discriminate against minority groups. What discriminate means is often not clearly defined in these objections, partly because the forms of bias in LLMs relevant to psychometrics are still being established.

However, one possibility is that while new precursors or causes of bias may emerge from the use of AI due to the way architectural decisions impact user interactions across groups, the ways these biases will manifest can be framed in terms of conventional bias frameworks. To begin thinking about this topic, we propose considering the metaphorical parallels between traditional psychometric notions of bias and adverse impact and analogous possibilities with LLMs.

We acknowledge that the methods to detect these biases may well need to be different top the methods for detecting their counterpart classical psychometric techniqies, even should the analogies prove useful for thinking about biases that can emerge with LLMs for psychological assessment.

Measurement invariance analogy

Measurement non-invariance involves an assessment or survey item measuring differently for a referent (i.e. majority) and focal (i.e. minority) group. An LLM parallel might be that a model encodes information or responds differently to a prompt because of task irrelevant content reflective of protected characteristics. Irrelevant information may be revealed in the prompt or inferred from language and reported experiences of majority and minority groups.

Differential validity analogy

Differential validity refers to different correlations or regression coefficients between a predictor and a criterion for a majority and a minority group. A parallel with LLMs might be that LLM encodings of constructed response tests used in scoring models, in the case of representational AI, or LLM generated scores themselves in the case of generative AI, relate differently to performance outcomes across majority and minority groups.

Adverse impact analogy

Adverse Impact in psychometrics refers to differences in selection rates due to the use of an assessment instrument, irrespective of whether or not DIF is present. A parallel from LLMs is different mean scores across groups on an automated interview, either where representational encodings are used to predict human scores, or alternatively, where generative AI is used to score responses against a rubric.

Reliability, validity and candidate experience including perceived fairness

Along with these concerns that relate specifically to bias in LLMs, psychometricians are equally concerned with the impact that choices about training data have on key psychometric criteria including forms of reliability, forms of validity and increasingly, candidate experiences including perceptions of fairness. However, in conversations we have with practitioners, the question that tends to arise first when it comes to A.I. in assessment relates to the idea of bias.

Relevant AI strategies for minimizing bias

Fortunately, data management strategies used in LLM training are a core focus of AI researchers and many steps are taken that may mitigate bias concerns. For example, methods ensuring data representativeness can contribute to ensuring our test taker populations are represented in training data. These efforts are part of broader work on AI alignment, a field that aims to ensure that AI model behavior is consistent with human norms and values. The data strategy methods that can prevent bias may broadly be categorised as preventative methods (i.e. steps taken in pre-processing or training before model release) and detection methods (i.e., monitoring and intervening during operational use).

Preventative methods (Wang et al., 2023)

Advanced methods exist to ensure data included in model training is representative of broad populations. Model developers examine the effect of different training sources and the impact of mixing proportions on model performance. Efforts are made to remove duplicated data which can degrade performance. Toxicity filtering is applied to removes toxic content that would make a person want to stop a conversation. Quality filtering examines perplexity, or how well the model predicts the text on holdout data.

LLM systems also rely on post-processing methods that could minimise the examples of bias we described. These includes fine-tuning models on domain specific data, reinforcement learning from human feedback (RLHF), and debiasing methods that all adjust the model prior to release. These steps are all targeted at at biases that occur prior to or during training. Guo et al. (2024) have referred to these biases as intrinsic biases.

Detection methods (Wang et al., 2023)

Operational guardrails can also be applied after deployment. These include content filters and moderation layers that detect problems by controlling outputs in real time without but altering the underlying model. Guardrails target biases that play out during LLM operation, which Guo et al (2024) have referred to as extrinsic biases.

Psychometric call to action

It is not clear the data management strategies used by LLM developers will help meet the requirements of psychometric users of LLMs with respect to bias. There are also AI research findings that psychometric users might find counter-intuitive. On certain measures of model performance, including content not directly relevant to the domain where the model will be used improves performance. This suggests a point-to-point correspondence between training data and a testing population might not be required or desirable.

Psychometric researchers are, to a large-degree, consumers of LLMs, applying conventional bias detection methods to LLM outputs (with the exception of early work by psychologists on model fine-tuning). To ensure that LLMs produce embeddings maximally suited to psychometric tasks and generate content appropriate for psychological assessment, psychologists and psychometricians need to integrate with computer scientists working on the prevention and detection methods discussed in this section.

Next section

Historical roots of AI assessment debates

Last section

Simulation with artificial crowds

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

LLMs and psychometric bias