Scoring (3 of 3): Neural contrastive pairwise regression (NCPR)

Introduction
Data preparation
Essay pairing
Target and labels
Specifying the architecture
Model training
Model inference
References

Introduction

In this first section we discuss an end-to-end neural technique for essay scoring called Neural Pairwise Contrastive Regression (NCPR). We discuss the implementation by Xie et al (2022). They described an analysis that used the method to predict essay scores in the Automated Student Assessment Prize (ASAP) automated essay scoring (AES) challenge.

The challenge was established by the Hewlett foundation and hosted on Kaggle in 2012. Today it is a classic benchmark dataset. We have seen this competition and data set before in scoring method chapters in part 3, where we used LLM zero and few shot scoring and also in part 4, where we used frozen embeddings in classical psychometric and machine learning models.

Data preparation

The first stage in NCPR is to prepare the data set. As with our earlier example with this data set, separate models are built for different essay prompts. Here we will just consider just one essay prompt to illustrate the method, again essay 10, as we did for LLM scoring.

With complex models a helpful start is to understand the data structure: what does the model take as input and what does it predict as output. The input data structure for this analysis is essay one response, essay two response, and the variable for prediction, here essay score 1 - essay score 2. Next, we create the essay pairs and the difference scores.

Essay pairing

The essay pairing is not entirely clear from the paper, but it appears to be created in the following way going by the text. Essays are processed in order and pairs are constructed between each essay and the next essay. Additionally, if pair a and b and pair b and c are included, pair an and c must be too.

Consider the first 3 essays: essay 1 is paired with 2, essay 2 is paired with 3, so essay 1 is paired with 3. This is needed so the model learns transitive relations i.e., if essay 1 is better than 2 and 2 is better than 3 then the model needs to learn that 1 must be better than 3. Without the additional pair for learning nothing enforces this consistency. Some pairings are omitted. Under this scheme, 1 is never paired with 4 for instance.

Target and labels

The outcome that is predicted in this use case is the difference between the human scores for the paired essays (e.g., essay 1 score - essay 2 score). In the code, it appears that transitivity learning may occur indirectly by randomly sampling the essays from groups with the same score rather than systematically in the way described.

Specifying the architecture

The first step in the architecture is to specify the same BERT encode m to create embeddings of each of the essays. Next, a Siamese network, two identical multilayer perceptron (MLP) neural networks (nn1, nn2), are specified that transforms the embeddings to a new representation for the next step.

Using identical networks means that it doesn’t matter which essay is first in the pair, you’ll get the same score difference with just the sign flipped. Next comes calculation of a difference vector between each pair of embeddings. Finally, a third MLP is added that predicts the outcome, I.e., difference score representing the contrast between the human assigned scores for the two essays.

Model training

Essay pairs are then fed through the encoder and multilayer perceptron model to predict the difference score in a forward pass. Next, the loss for this pair is calculated using Mean Squared Error (MSE). This loss is backpropagated through the architecture, calculating gradients with respect to the loss for all model parameters.

In optimisation all weights are adjusted using gradient descent (or similar e.g., AdamW). The adjustment size is determined by the learning rate. The learning rate can differ for each model component, smaller for the encoder which were fine-tuning, higher for the MLPs which are learning from zero. We train with repeated sequences of forward passes, back propagation and optimization until convergence (I.e., until loss bottoms out).

Model inference

Now the model is trained we can move on to predicting essay scores that have not been human scored or for which we do know human scores but choose hold them back for validation. To do so, we create a new data set of essays to feed through the trained model in a single forward pass. In actual scoring new essays this step is called inference rather than training.

This key question here is what essay the new essay being scored should be paired with. The answer is that we can pair the essay to be scored with a randomly selected essay from the training data; the non-scored, paired essay must me in response to the same prompt as the scored essay. In fact, the authors report doing this multiple times selecting K random essays and using vote averaging to determine the predicted scores.

References

ReferencesXie, J., Cai, K., Kong, L., Zhou, J., & Qu, W. (2022). Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics (pp. 2724-2733).

Next page

Neural end-to-end architectures

Previous page

Historical roots of AI assessment debates

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).