Scoring (2 of 3): frozen embeddings, downstream predictions

In this section we show an example of frozen hybrid scoring from the taxonomy discussed in a the crash course in transformer scoring for unstructured data. We use a sentence encoder to get embeddings of candidate constructed responses and build models to predict human scores of the responses. We will again use the ASAP-SAS data and specifically the question 10 subset. Before we begin, there are several points that are worth discussing. You can download the notebook containing the analyses for this.

These models are frozen hybrids in the context of the psychometrics.ai crash course taxonomy because unlike the zero and few shot models, these models fuse the transformer with a prediction head. Unlike the neural end-to-end examples like neural contrastive pairwise regression, no encoder weights are updated via back propagation. The encoder here is used only as a static feature extractor. Supervised learning occurs only in the model head fitted on top of the frozen embeddings.

Frozen transformer embeddings are one way to score constructed responses. The encoder provides fixed text representations, and a separate head models the relationship between those representations and human scores. There is a history of methods using hand crafted features that are not always out-performed by embedding models. Embeddings can be used in stacked models where they augment these hand-crafted features. Hand-crafted feature models are sometimes preferred because of their intepretability.

This section is not intended as a search for the best possible scoring model for ASAP-SAS Q10. Instead, it is a worked example of a frozen-embedding scoring pipeline: a pre-trained sentence encoder is used as a fixed feature extractor, and supervised model heads are trained on top of those fixed representations. The aim is to show what this relatively cheap and reproducible baseline can achieve before moving to rubric engineering, feature engineering, retrieval, ensembles, or fine-tuning.

Seven supervised scoring heads

We will compare seven model heads practitioners often try with frozen embeddings: ordinal logistic regression to capitalize on ranked structure; nominal logistic regression which is order agnostic; linear and non-linear support vector machines to maximises margins (gap) between classes; a multilayer perceptron as a simple neural scoring head models nonlinear relationships; and random forests and gradient boosting as tree-based ensemble methods.. We will try all with and without principal component reduction.

Embeddings are often linearly probeable, meaning they reveal linear relationships between their combined dimensions and a criterion. This is a good reason to expect the linear models to perform well. In contrast, we might expect tree based methods such as random forests to perform less well since since they split on individual dimensions, and approximating a truly linear, distributed signal requires many such splits. Gradient boosting is also tree-based, but by correcting residual errors it can be more competitive than random forests.

Pre-processing

It is not clear that PCA before supervised scoring heads will improve will improve performance. PCA is un-supervised and so risks discarding criterion information. Sentence embedding spaces are often anisotropic, and score-relevant signal may be distributed across many dimensions, including lower-variance dimensions that PCA could lose unless dominant components are discarded.

Norming removes magnitude differences but does not alter directions, so it cannot correct anisotropy and many embeddings are already normed in any case. Whitening goes further than PCA oandr norming by decorrelating the dimensions and rescaling to unit variance. We do not use whitening here because early tests showed that standardization, which is a milder variance-equalizing step hurt performance.

Method and data

We evaluated the models on Question 10 from the Automated Student Assessment Prize Short Answer Scoring dataset (ASAP-SAS), the same short-answer item used in the previous prompt-based scoring example. This contains 1,640 scored student responses, with scores ranging from 0 to 2. For each outer fold, the data were split into 80% outer-training data and 20% outer-test data. The outer-training portion was then split again, with 75% used to fit candidate models and 25% used to select the hyperparameter value. This gives an effective 60/20/20 train-validation-test structure within each fold. After hyperparameter selection, the selected model was refit on the full 80% outer-training data and evaluated once on the held-out 20% outer-test data.

Sentence encoder

We used all-MiniLM-L6-v2, a pre-trained sentence-transformer model, to generate frozen 384-dimensional embeddings for each response. This is a well-known, light and general-purpose encoder that encoded the dataset locally in seconds. Using it in a non-fine-tuned way keeps the encoder constant across models so differences are due to the choice of classifier. All-mpnet-base-v2, a larger encoder, performed no better.

Training, validation and testing sub-sampling

To ensure this train/validation/test procedure produced a stable, reliable estimate of performance rather than one dependent on a lucky or unlucky split, we repeated it using 5-fold cross-validation, so that every response served as test data exactly once per repetition, and we repeated this entire 5-fold procedure 10 times with different random partitions of the data, yielding 50 total train-tune-test evaluations per model.

QWK performance metric

Quadratic Weighted Kappa (QWK) is our outcome measure for the comparisons. This extends Cohen's Kappa, a statistic correcting for the level of agreement expected by chance. QWK adds a penalty term proportional to the squared distance between the predicted and true score categories. A model that confuses a 0 for a 2 is penalized more heavily than one that confuses a 0 for a 1 whereas raw accuracy an un-weighted treats every misclassification the same.

Nadeau and Bengio (2003) corrected comparisons

To compare models accounting for the fact that our 50 evaluations were not independent (training sets overlap across the 50 re-samples) we used the corrected resampled t-test proposed by Nadeau and Bengio (2003) rather than a standard paired t-test. The Nadeau-Bengio correction addresses this by inflating the variance term used to compute the t-statistic with an additional factor reflecting the proportion of data held out for testing relative to training.

Results

The 2D visualization of the embeddings is a t-SNE plot that shows the 1,640 Q10 responses projected into two dimensions, colored by score. The motivation is to check whether the embeddings carry signal related to score before trusting a classifier built on them. While distinct clusters form, colors mix substantially within cluster. We see enough signal for a classifier to do better than chance but perhaps not enough for highly accurate scoring.

Across the seven classifiers and two preprocessing conditions (14 combinations in total), each tuned within the outer training fold using a stratified validation split and evaluated across 50 repeated train-validation-test fold-evaluations, Nominal Logistic Regression on raw embeddings produced the highest mean Quadratic Weighted Kappa (QWK = 0.711).

In general, simplicity was rewarded: linear-additive models (both logistic regression variants and the linear-kernel SVM) matched or outperformed models capable of representing nonlinearity, including the RBF-kernel SVM, Random Forest, Gradient Boosting, and the Multilayer Perceptron which likley needed more data and better tuning.

The top-performing model is Nominal LR on raw embeddings, with several linear models (Linear SVM, SVM, Ordinal LR) showing competitive results. Dimensionality reduction via PCA(50) provides little to no benefit for most classifiers and hurts MLP performance. Random Forest models are the only ones significantly worse than the best configuration according to the Nadeau-Bengio corrected pairwise p-values.

This heat map presents mean QWK scores by model and preprocessing strategy. Nominal LR achieves the highest performance on both raw and PCA-reduced embeddings. PCA(50) yields minimal benefit for most models and substantially degrades MLP performance. Colour intensity reflects QWK value.

The final ranking confirms Nominal LR on raw embeddings achieves the highest performance with a mean QWK of 0.7114. Several linear models, including Linear SVM and SVM on raw features, follow closely with negligible differences . Most configurations are statistically comparable to the best model, but Random Forest variants are significantly worse. PCA(50) preprocessing offers little to no improvement and sometimes worsens performance.

Interpretations

The result falls below human-human agreement (QWK = 0.884), so the accuracy is lower than the consistency of two trained human raters. Zero-shot and few-shot prompting of GPT-4 on a subsample of this same essay set achieved QWK values of 0.733 and 0.750 respectively in an earlier section. Higher performance has also been achieved using non-transformer feature engineering, such semantic features, rubric matching, and stacking models.

There is little separating the winning methods, suggesting once you have decent embeddings the classifier choice has less effect. The tree based methods did not perform as well as linear models, indicating they struggle to approximate what the other models show are linear relationships. When comparing to reported QWK in other papers check if their train, validate, test split was not lucky. Here we average over 50 repetitions of design.

Taken together, these comparisons show that frozen embeddings with simple heads capture reliable association between text and score at minimal computational cost and with no task-specific engineering. While approaches that add feature engineering, fine-tuning, or LLM prompting achieve higher performance, the frozen-embedding results are a cheap baseline on what is achievable.

References

Nadeau C. & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3):239–281, 2003.

Next page

Convex hull analyses

Last page

Pseudo alpha & omega

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).