- A healthy preoccupation
- Possibilities and challenges of embeddings
- Absence of person sample
- Reasons to cautiously proceed
- Similarities for correlations
- Pseudo alpha
- Pseudo omega
- Pseudo omega hierarchical
- Further reflections on Omega
- Item level diagnostics
- Conclusion
- References
A healthy preoccupation
Psychometrics has a rightful pre-occupation with the idea of reliability. Rightful because a condition for meaningful interpretation of measurement instruments measures requires that its readings do not fluctuate arbitrarily and unsynchronized with the measurement target. There are three downstream reflections of this epistemic position that permeate what we do in applied practice. Estimates of reliability permit estimates of measurement precision via the standard error of measurement; estimates of reliability allow us to correct observed relationships for measurement error; and the reliability of a measurement method gives an upper bound on the validity coefficients that can be observed using a given method.
Possibilities and challenges of embeddings
The psychometrics of learned representations does not begin with a sample of respondents providing observed item responses. This has implications for our interpretation of reliability that remain largely unexplored. Here we outline several issues that arise. We discuss that the conventional interpretation of reliability as the proportion of observed-score variance attributable to true-score variance rather than measurement error no longer applies, despite that the same calculations can be performed on similarity matrix and correlation matrices. Familiar calculations on a cosine similarity matrix are therefore not reliability in the conventional sense. These new and to some degree analogous quantities are not reliability coefficients in the conventional sense and may instead be interpreted as indices of semantic cohesion because the same calculations produce quantities with different interpretations.
To jump ahead, this section will conclude that reliability analogous calculations on cosine or other similarity matrices are important concepts in the context of learned representations. We will show worked examples of calculating two ‘pseudo’ measures of ‘reliability’ from embeddings, alpha and omega. We discuss how their interpretations are affected by the absence of a person sample. To begin, let us compare two situations: one in which learned representations are studied as a closed representational system and one in which embedding-based pseudo-reliability calculations are used because of what they might tell us about the reliability of human responses to a measurement instrument.
Absence of person sample
When no participants being scored, estimates of score precision using the standard error of measurement disappear as a primary motivation for reliability (there is no scale standard deviation with which to calculate a standard error of measurement). Correction for attenuation is not impossible because there are "no observed sum scores." It is impossible because there are no respondent-level observed variables whose relationships are being estimated. In any case, latent variable estimates are an output of the factor model, as they are in conventional psychometrics. An open question is whether these quantities provide an upper bound on the validity coefficients that can be observed when using learned representations. Given that, in practice, reliability coefficients are often substantially higher than validity coefficients, one might question the utility of calculating reliability in the context of learned representations altogether.
Reasons to cautiously proceed
An important reason to proceed, however, is that pseudo reliability might tell us about the reliability of the measure when using human responses. Embeddings might be used to predict human response-based estimates of reliability, in addition to supporting analogous calculations to reliability. Even in the case where reliability is calculated via an analogous operation on a matrix of embedding similarities, however, the resulting quantity is computed from a fixed similarity matrix rather than from a sample of respondents. While there is a fixed set of items, there is no population of respondents from which response data are sampled when analysing encoder embeddings or decoder internal representations. It is possible to study a model’s generated responses.
However, a decoder's generated behaviours are arguably a more indirect reflection of the model's internal representations than encoder embeddings. Pseudo-reliability may yield pre-knowledge of the human response reliability of a scale, including its eventual standard error of measurement and upper bound on observed correlations if the relationship between embedding- and response-based reliability is sufficiently strong. Ways that have been proposed to calculate reliability include building prediction models. This appeals at first encounter because of the advantage that prediction creates clear water between the embeddings from AI and the reliability coefficient itself. Reliability has strong foundations and cultural practices that carry the advantages of being widely accepted. Saying we are predicting reliability rather than estimating or calculating the quantity itself is less imposing.
Similarities for correlations
However, following the proposal to substitute the respondent correlation matrix with an item cosine similarity matrix (Guenole et al., 2024), we can calculate quantities that are mathematically analogous to familiar reliability coefficients, such as Cronbach’s alpha. By fitting the pseudo factor model to embedding matrices and producing factor structures before respondent data are collected, we can similarly calculate McDonald’s omega. These values may be informative about or predictive of their human response counterparts. We now explore each in turn, providing data and code.
We show how they correspond to actual reliabilities and discussing how interpretation differs between embeddings and human responses. Importantly, the cosine similarity matrix shares key algebraic properties with a correlation matrix—it is symmetric, positive semidefinite, and bounded in [-1,1] which mechanically at least, permits the application of the factor model and the reliability formulas explored here. However, unlike correlation coefficients estimated from response data, cosine similarities are not derived from respondent sampling, so conventional interpretations and benchmarks may not apply.
Pseudo alpha
The formula for alpha translates directly when self-similarities (the diagonal) play the role of item variances and self-other similarities (off diagonals) play the role of covariances. It can be straight forwardly applied to the cosine similarity matrix.
α = (k / (k−1)) × (1 − Σδᵢ / Σsᵢⱼ)
Where Σδᵢ is the sum of diagonal similarities and Σsᵢⱼ is the sum of all elements in the cosine similarity matrix. The first part of the formula is a correction, or scaling factor, that adjusts for the number of items. The second part of the formula is one minus the ratio of the diagonal similarities to the total similarity in the cosine similarity matrix. When the number of items is small, the correction has its largest effect on the ratio component of the formula and as the number of items grows larger, the correction value converges to one and the ratio component converges to the value of the overall expression. High values indicate greater similarity among items in the embedding space. This calculation may one day serve as a quick zero-cost screen during scale development or item pool refinement, before human responses are collected.
Pseudo and real alpha for each of the six scales of the G50 are presented alongside item-level statistics, in a format that will be familiar to psychologists working on scale development. Pseudo alpha values were somewhat lower than their response-based counterparts across all scales. Unlike response-based alpha, there is no direct analogue of the corrected item-total correlation for pseudo alpha because there are no person scores from which to compute a scale composite. We propose the mean of an item's cosine similarity with all other items in its scale as the analogue, and present this alongside the corrected item-total correlation from the response data.
We also present alpha-if-item-deleted for both pseudo and real in the code and results on GitHub, computed by removing each item in turn from the relevant sub-matrix or response matrix respectively. Because the number of scales within a single inventory is equal to the number of data points available for any rank-order comparison, conclusions about the correspondence between pseudo and real alpha must be treated with caution. With that caveat, Figure 1 plots pseudo against real alpha across the six scales, and we note for completeness that the Pearson and Spearman correlations were r = .81 and p = .89 respectively.
Scale | N_Items | Pseudo_Alpha | Real_Alpha |
AN | 8 | .72 | .78 |
CO | 9 | .76 | .85 |
DE | 8 | .78 | .85 |
DI | 8 | .75 | .81 |
NE | 9 | .79 | .89 |
SC | 8 | .82 | .86 |
These associations are error prone underpowered given there are only as many cases as there are scales. It is not easy to determine the relationship between these deterministic reliability quantities from embeddings and estimated reliability based on human responses because of the low number of cases on which to estimate the association. The number of cases is equal to the number of scales. One idea is to present within inventory monotonicity graphs and Spearman rank correlations recognizing N=6 is low for any strong conclusions. We do this here. We might also pool scales across inventories to check against response based alpha estimates with larger N, but this is if little use to the applied psychometrics professional wanting reliability insights about their given scale.
Pseudo omega
Model based reliability is generally preferred in response-based psychometrics because it relaxes the often-unrealistic assumption of tau equivalence, i.e., items have equal relationships to the underlying construct. Alpha mis-estimates reliability in settings where tau equivalence is violated. It is possible to estimate model-based reliability, omega, once a factor model’s parameters are estimated. Analogously, it is possible to compute pseudo omega on the cosine similarity matrix from the estimated factor loadings and uniquenesses.
ω = (Σλᵢ)² / [(Σλᵢ)² + Σδᵢ] Where λᵢ are factor loadings and δᵢ are uniquenesses. Pseudo omega was estimated by fitting a six-factor model to the full 50-item cosine similarity matrix using maximum likelihood extraction in Lavaan, followed by oblique Procrustes rotation toward the hypothesised scale structure as the target. Factor-to-scale correspondence was established using the Dominant Average Absolute Loading (DAAL) criterion with the Hungarian algorithm to enforce unique one-to-one assignment. The same procedure was applied to the human response data for comparability. For the human response model, fit was acceptable: χ²(940) = 1954.84, p < .001, CFI = .90, TLI = .86, RMSEA = .044, SRMR = .032. For the pseudo model, chi-square-based fit indices are not interpretable because n.obs is set to an arbitrary value; the SRMR was .046. Pseudo omega values ranged from .46 (Antagonism) to .83 (Schizotypy), and were uniformly lower than their response-based counterparts, which ranged from .79 to .90. The exception was Schizotypy, where pseudo and real omega were nearly identical (.83 and .85 respectively). The largest discrepancy was observed for Antagonism, where items span semantically heterogeneous behavioural domains, flattery, deception, dominance, aggression, that do not cluster tightly in embedding space despite reflecting a common latent trait in human response data. Although the number of scales is too small to place interpretive weight on any rank-order association, we note for completeness that the Pearson and Spearman correlations between pseudo and real omega across the six scales were r = .71 and p = .54 respectively.
Scale | N_Items | Pseudo_Omega | Real_Omega |
AN | 8 | .46 | .79 |
CO | 9 | .74 | .87 |
DE | 8 | .59 | .86 |
DI | 8 | .64 | .81 |
NE | 9 | .78 | .90 |
SC | 8 | .83 | .85 |
Pseudo omega hierarchical
Omega hierarchical can be calculated if a bifactor model is fitted and only the sum of squared loadings on the general factor is included in the numerator. The quantity—while based on deterministic embeddings—is derived from estimated factor model parameters, but the conventional interpretation of the resulting variance components no longer applies. It might instead be interpreted as the proportion of variance in the embedding similarity structure attributable to the dominant factor. ωh = (Σλg)² / [(Σλg)² + Σλs² + Σδᵢ] Where λg are general factor loadings and λs are specific factor loadings. Pseudo ω_h was estimated by fitting a bifactor model to the full 50-item cosine similarity matrix using maximum likelihood extraction in lavaan with seven factors: one general and six specific and orthogonal bifactor rotation. The same procedure was applied to the human response data. For the human response model, fit was good: χ²(896) = 1677.80, p < .001, CFI = .92, TLI = .89, RMSEA = .040, SRMR = .029. For the pseudo model, chi-square-based fit indices are not interpretable; the SRMR was .042. Omega hierarchical was computed for the full 50-item instrument, with the general factor loadings entering the numerator and total composite variance, general, specific, and unique, forming the denominator. Pseudo ω_h was .94 and human ω_h was .88. Unlike alpha and omega, where pseudo values were consistently lower than human values, pseudo ω_h exceeded its human counterpart, suggesting that a single dominant semantic dimension accounts for a particularly large proportion of composite variance in the embedding space, larger, than the general factor accounts for in human response data.
Further reflections on Omega
What is not immediately apparent from the pseudo factor breakdown is that, in conventional single-time-point factor analysis, uniqueness is typically interpreted as reflecting both item-specific variance and random error. In human response-based factor analysis these components of uniqueness can be disentangled by adding an additional measurement point. Longitudinal factor analysis allows correlated residuals between the same item at different timepoints representing remaining variance after controlling for the factor that is not residual error. If we were to run such a model with perfectly reproducible sentence encoder-based cosine similarity, the longitudinal model would be degenerate because the repeated measurements would be identical, leaving no stochastic residual error to distinguish from item-specific variance. (assuming perfect reproducibility, which is reasonable assuming the same model version and hardware). This means that, assuming perfect reproducibility and adequate model fit, uniqueness in the single-group case may be interpreted primarily as item-specific variance rather than random measurement error. Otherwise, uniqueness will also reflect variance arising from imperfect reproducibility and model misfit. (For generative AI models, this assumes a fixed seed, temperature, and prompt. If instead the model's generated responses are analysed, conventional alpha and omega can be estimated from the resulting response correlation matrix).
Item level diagnostics
These results allow only cautious interpretation because the sample size for any comparison between pseudo and human response indices is equal to the number of scales, which is six in the present case. Pooling scales across multiple inventories could increase this N, but that’s of limited practical value to the applied scale developer, who works with their own instrument and cannot draw on data from unrelated measures to evaluate it. All item-level diagnostics: alpha-if-item-deleted, omega loadings, item-total correlations, and their pseudo analogues, are presented in the tables and code on GitHub but are not discussed in detail here. They do not always show consistent item-level correspondence between pseudo and human response indices, and their interpretation under the embedding framework is at this point unclear.
Conclusion
Analogous operations to those performed on a correlation matrix to obtain reliability can be performed on a cosine similarity matrix amongst embeddings. These are operations produce different quantities despite the calculations being identical. Same mathematics, different inputs, so the interpretation is different. Although these quantities are not reliability coefficients in the classical psychometric sense, they may prove useful in two respects: as indices of the internal coherence of representational AI systems, and as potential predictors of the reliability of corresponding human response-based measures.
References
McDonald, R. P. (2013). Test theory: A unified treatment. psychology press.
Raykov, T., & Marcoulides, G. A. (2016). Scale reliability evaluation under multiple assumption violations. Structural equation modeling: a multidisciplinary journal, 23(2), 302-313.
Next section
Last section
Pseudo-configural, metric and scalar invariance
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).