- Faithfulness and Plausibility
- Faithfulness
- Comprehensiveness
- SHAP values and variations
- Local Interpretable Model-agnostic Explanations (LIME)
- Limitations of perturbation based methods
- Attention-based attribution methods
- Plausibility
- Similarity to subject matter experts (SIMSE)
- Choosing a method: Practical decision framework
- Summary
- References
Faithfulness and Plausibility
In this section I describe methods used in explainable artificial intelligence (XAI) for understanding the contributions of different aspects of a prompt to a language model’s output. I discuss how methods can be used to get an understanding of what determined model outputs, such as encoder embeddings or decoder estimates of item difficulty or discrimination, or even scores for a constructed response SJT or interview question.
These XAI methods should complement traditional psychometric validation when using language models in psychometrics. Along with the need to show reliability, validity, and check impact, we must establish what the model is actually weighting most heavily. An LLM might produce scores with strong psychometric properties while weighting construct-irrelevant features. Only when we know the computational basis for scores can we meaningfully interpret traditional validation evidence.
I also introduce a proof of concept with a single example of scoring a board level executive’s ability to ensure cohesion in the board from an interview response. This worked example is intended as an illustrative case study. It reveals both the potential of XAI methods to decipher AI scoring processes but also the considerable work required to make these approaches interpretable for psychometric practice. Get the Python notebook for the demonstration in the sections that follow.
Faithfulness
Faithfulness in XAI describes how accurately our explanation captures the model’s actual computational process (Jacovi & Goldberg, 2020), distinct from plausibility which captures alignment with human intuition.
Faithfulness can be categorised into model-based faithfulness and model-agnostic faithfulness. Model-based faithfulness describes precisely how the model produces its outputs, whereas model-agnostic faithfulness captures what the model does at a higher level of abstraction by examining only inputs and outputs.
While model-based faithfulness is more plausibly achievable for encoder embeddings, for decoder-only LLMs (e.g., GPT, Llama) existing methods do not provide complete model-based faithfulness because of token dependencies introduced by autoregressive generation. Removal or modification of a single token can alter the meaning of all tokens that follow. Nonetheless, model-agnostic faithfulness investigations are still widely and productively used in applied practice with both encoder and decoder models.
Common metrics for faithfulness are often based on permutation methods. Below i provide such an example for a decoder based perturbation investigation of model agnostic faithfulness. I will show the general principles using a permutation-based decoder approach that checks which words were most influential in determining an LLM’s score of an executive’s ability to drive cohesion and alignment at board level. Consider the following response by an executive to an interview probe.
“When I became chair, the board was fragmented after a failed acquisition. Directors avoided open disagreement, and decisions were drifting into informal groups. I met individually with each member to surface frustrations, then introduced a transparent record of board resolutions so rationales and follow-ups were visible to all.”
Comprehensiveness
We give a prompt to an LLM to score this passage, such as “On a scale from 0 to 100, predict the writer’s capability on the dimension ‘Driving board cohesion and governance’ based on the response given which describes how they handled a major board-level leadership challenge.” The notebook gives a more detailed prompt and a brief rubric for the LLM to score against.
In response to these instructions, the LLM will give a score back on a scale from 0 to 100, but how can we work out which words in the passage, or combination of words, were most influential in determining the LLM score? This question gets to the core of the faithfulness in XAI and has high relevance in psychometrics.
One way to check this is to start LLM scoring permutations of the executive’s response that omit certain words and check how the score changes. Bigger score reductions for certain words suggest the score was determined more by those words, while a dramatic drop when the top-k tokens are removed offers a preliminary heuristic that the score is somewhat explainable via prominent tokens.
Complementary to comprehensiveness, which examines the effect of score reductions due to omission of the top-k tokens, is the idea of sufficiency. Sufficiency asks instead whether the score stays the same, or similar, if we only retain the tokens (or words) that the perturbation identified as important.
Example attribution results (from a single GPT-3.5 Turbo run) are summarized below. The model seems to reward concrete terms like member, failed acquisition (perhaps indicating experience) but penalizes incompleteness (work to do) and fragmentation (individually, fragmented). Positive words are harmful in negative contexts. ‘Board’ and ‘met’ are penalized due to associations like "fragmented board" and "met individually."
When “i” appears in action-driven phrases like “I became chair” or “I introduced,” it’s rewarded. Where the self is implicated in a negative context, e.g., “I met …individually” it is penalised. Transparent scores negatively, perhaps suggesting the model favors outcomes over processes or maybe that transparent was used in negative contexts in the model training data.
Helpful words | Impact | Harmful words | Impact |
member | 10.0 | work | -11.70 |
failed | 8.30 | board | -10.00 |
to | 6.70 | met | -10.00 |
after | 6.70 | transparent | -8.30 |
to | 5.00 | individually | -8.30 |
rationales | 5.00 | When | -6.70 |
of | 5.00 | a | -5.00 |
to | 5.00 | say | -5.00 |
chair, | 5.00 | were | -5.00 |
I | 5.00 | I | -5.00 |
with | 3.30 | so | -5.00 |
acquisition. | 3.30 | resolutions | -5.00 |
became | 3.30 | do. | -5.00 |
follow-ups | 1.70 | record | -5.00 |
surface | 1.70 | fragmented | -5.00 |
These findings are illustrative only and the example requires caveats. Operational use would require replication across diverse responses, validation against expert judgments of feature importance (see the section on SIMSE below) and testing for consistency across different prompts and model versions.
SHAP values and variations
In practice, however, while these methods work for black box models there are more powerful methods available when we have access to the model’s internal workings. For one thing, the example shared is a word level example. Word level comprehensiveness is often used for its correspondence with human level thinking (higher plausibility).
However, language models work on tokens which are sub-words, so word level approaches sacrifice faithfulness and are more like behavioural than mechanistic faithfulness (Doshi-Velez & Kim, 2017). Higher faithfulness can be obtained with token level permutation methods but the human interpretability may be lower if the tokenization is unintuitive.
Additionally, the permutation approach shown leaves one word (or token) out of the scored passage at every forward pass (run through the model), effectively evaluating each word in one context only: all remaining words. It is therefore a heuristic approach. Some words may be more influential when included with certain other words, so we wish ideally to examine the impact of each word in all possible contexts. When we average the effect of a word in every possible context and weight each context using SHAP weights we can get the SHAP values for comprehensiveness.
In practice, exhaustive calculation of SHAP values is not computationally feasible and a sampling approach is used called KernelSHAP (Lundberg & Lee, 2017). In the Lundberg & Lee approach., the results of the token sampling process (i.e, token coalitions) are structured in a matrix with binary membership variables indicating whether a token was included in each sampled coalition. The samples are weighted by a kernel that gives higher weight to medium-sized coalitions to approximate Shapley values. Medium sized coalitions balance isolating token effects while maintaining context. LLM scores are regressed on the binary membership variables. The betas approximate SHAP values.
Permutation based comprehensiveness metrics discussed so far are approximate and can be used when model internals are accessible or not. However, when models are open source we can access backpropagated gradients with respect to inputs along the path to outputs. Sundararajan et al. (2017)’s GradientSHAP is a fast approximation that multiplies gradients by (input - baseline) differences, averaging over multiple sampled baselines. Integrated Gradients uses multiple interpolation steps between baseline and input for higher accuracy.
Local Interpretable Model-agnostic Explanations (LIME)
LIME (Local Interpretable Model-agnostic Explanations; Ribeiro et al., 2016) also uses perturbation but weights samples by proximity to the original input rather than by coalition size. The Ribeiro et al. (2016) method is as follows. Generate perturbed token combinations by randomly masking tokens from the original input. Calculate model scores for each perturbed sample. Weight each sample by its similarity to the original input (fewer tokens removed = higher weight). Fit a weighted linear regression of model scores on binary token indicators (1 = present, 0 = absent). The regression coefficients are the token importance values for the tokens.
Limitations of perturbation based methods
All perturbation methods assume feature independence and may miss interaction effects. SHAP assumes output additivity which can fail for highly non-linear models. LIME's locality assumption may not hold if the model behaves differently in nearby input spaces.
Attention-based attribution methods
Attention based attribution methods extract the weight matrices from transformer layers. The simplest approach averages final-layer attention weights across layer heads. The Attention Rollout method (Abnar & Zuidema, 2020) recursively multiplies matrices. Importantly, high attention doesn't guarantee causal importance (Jain & Wallace, 2019), so these methods must be combined with the perturbation methods above. Attention-based methods also require open-source models where internal weights are accessible.
Plausibility
While comprehensiveness tells us how accurate our understanding of how the model produces scores is, it does not tell us how intuitive the model is to human evaluators. The intuitiveness of human scores is called plausibility.
Similarity to subject matter experts (SIMSE)
A metric for examining plausibility is similarity to subject matter experts (commonly evaluated using rank correlation between model attributions and expert annotations; DeYoung et al., 2020). The process involves calculating attribution scores for all tokens using one of the chosen methods (SHAP, LIME, perturbation, etc.). Next, subject matter experts manually label or rank tokens by importance for the prediction. I spend less time with this method because its implementation is straightforward once a comprehensiveness metric is available.
Human importance ratings of each token when evaluating the passages can then be compared to the automated calculated attribution scores using similarity metrics (Spearman’s ρ, Pearson’s r) or ranking overlap measures (top-k precision, Kendall’s τ). Higher correlations indicate that the automated method captures human expert intuition. This validates whether computational attributions align with domain expertise and human reasoning about feature importance.
Choosing a method: Practical decision framework
Selecting the right method depends on three questions. Can you access the model's internals? If yes (open-source models), use gradient methods. If no (API access only), use perturbation. How much compute can you afford? SHAP is most accurate but expensive and can be saved for final validation. Simple perturbation is faster for development. Do stakeholders need to understand your explanations? If so, validate against human expert judgments using plausibility metrics to show the model outputs are reasonable to humans.
Summary
New methods for explaining AI behaviour are required in psychometrics. The need for these new methods may be surprising because of the implications they carry. First is that conventional criteria must be augmented for AI. With human ratings, inter-rater consistency makes us comfortable taking judgements at face value. In CTT and IRT it is clear how input changes lead to output changes. We cannot trace model processes easily with AI, raising the uneasy possibility that we don’t know the reason for AI scores.
Some of these methods described are also summary explanatory methods based on sampling techniques or are designed for closed source models where we do not have access to model weights. Their use implies some level of acceptance of black box scoring in psychometrics. In practice, AI methods do necessitate new explainability methods and these are being used in AI psychometrics.
References
Abnar, S., & Zuidema, W. (2020). Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928.
DeYoung, J., Jain, S., Rajani, N. F., Lehman, E., Xiong, C., Socher, R., & Wallace, B. C. (2019). ERASER: A benchmark to evaluate rationalized NLP models. arXiv preprint arXiv:1911.03429.
Doshi-Velez, Finale, and Been Kim. "Towards a rigorous science of interpretable machine learning." arXiv preprint arXiv:1702.08608 (2017).
Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685.
Jain, S., & Wallace, B. C. (2019). Attention is not explanation. arXiv preprint arXiv:1902.10186.
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144).
Shapley, L. (1953)A Value for n person Games. In Kuhn, H. and Tucker, A., Eds., Contributions to the Theory of Games II, Princeton University Press, Princeton, 307-317.
Sundararajan, M., Taly, A., & Yan, Q. (2017, July). Axiomatic attribution for deep networks. In International conference on machine learning (pp. 3319-3328). PMLR.
Next page
Moral foundations as a use case
Last page
Testing standards: Reliability and validity
Return home
Psychometrics.ai