Crash course in transformer-era automated scoring

Background to automated scoring
Generative AI with zero-shot or few-shot prompting
Embeddings as input to other models
End-to-end neural scoring models
Category stacking and ensembles
More considerations
References

Background to automated scoring

Prior to transformers automated text scoring used hand-crafted features (word counts, n-grams, readability metrics) combined with classical machine learning models like support vector machines and random forests. Early neural approaches like recurrent and convolutional networks began to automate feature learning but were eventually superseded by transformer architectures.

With transformer-based models, we can now approach automated text scoring in three main ways. These include using large language models (LLMs) with minimal training data (zero shot and few-shot), extracting embeddings as features using light-weight encoder for traditional modeling, and end-to-end neural architectures with full or fine-tuning. I discuss each of these categories below.

There are many strategies that psychometricians use and the categories and example papers I discuss here are illustrative rather than exhaustive. My goal for the upcoming sections on scoring is to provide a representative overview of the main methodological approaches rather than a comprehensive survey. I provide enough detail to know the options and enable us to code examples.

Generative AI with zero-shot or few-shot prompting

First, we can simply ask an LLM to score via zero-shot or few-shot prompting, as demonstrated by Jiang and Bosch (2024) who used GPT-4 with various prompting strategies to score short answers from the Automated Student Assessment Prize (ASAP) Short Answer Scoring (SAS) dataset (Shermis & Hamner, 2012, 2013), a classic psychometric scoring data set released by the Hewlett foundation and stored on Kaggle.

They achieved a quadratic weighted kappa (QWK) of 0.677 without any model training. This method does not achieve state of the art (SOTA) results but has some advantages. It requires no training data, is fast to prototype, and does not require deep machine learning know-how. This makes it useful for quick proto-typing or when working with limited resources.

The transparency is variable. You do not know exact activation level explanations but can ask the large language model (LLM) for its reasons or use chain of thought prompting. XAI methods can also be used. Dynamic RAG scoring is also possible where the most similar scored responses are retrieved at inference and appended to the prompt (Chu et al., 2025).

This approach can be expensive due to required API calls unless you can run LLMs locally. We’ll show this method in this section on psychometric methods and tools (section 3) because while it uses rubrics there is no training involved, it is simply prompt based-scoring evaluation. Unlike the embedding or end-to-end approaches, the model weights remain entirely unchanged.

Embeddings as input to other models

Second, we can extract frozen embeddings from pretrained models and use these along with other potential features (i.e., data augmentation) in classical psychometric models or ensembles to predict human scores. An example of this approach category is illustrated by NPCR (Neural Pairwise Contrastive Regression: Xie et al., 2022).

These authors show the current state-of-the-art for automated essay scoring (publicly on the ASAP data set), which uses pairwise contrastive regression on frozen BERT embeddings to achieve an average QWK of 0.817 on the ASAP dataset. This category also includes methods that use similarity between answer embeddings and ideal answer embeddings e.g., (de Mohler, & Mihalcea, 2009).

The main advantages of this approach are that it is more interpretable at the feature level, it can be cheaper due to not needing API calls if run locally. However, it requires labelled training data and more technical skills to implement well (e.g., knowledge of how to deal with challenges like class imbalance). We will demonstrate these methods in section 4 on AI psychometric hybrid models because the approaches take transformer embeddings and use them as input to non-neural, conventional statistical analyses.

End-to-end neural scoring models

Third, we can use end-to-end neural architectures with trainable weights to predict human scores. This includes encoder-only models with classification heads trained via cross-entropy loss, as well as decoder-only models fine-tuned while preserving their generative capabilities.

For example, Ormerod and Kwako (2024) who fine-tuned open-source decoder models (Mistral, Llama-3, Gemma, Phi-3) using QLoRA on the ASAP dataset, achieving QWK scores ranging from 0.76-0.79 while requiring only consumer-grade hardware. Do et al. (2024) explored reinforcement learning based approaches for automated scoring

Some transformer models first leverage pre-training on textual entailment tasks (determining whether one text logically follows from another) and then fine-tuning on grading data, which improves their ability to assess semantic overlap between student and reference answers.

The key advantage of these techniques is that model components can be jointly optimised. However, these methods require training data and are technically demanding. They might in future but have not yet achieved SOTA performance on public leader board data. They may well have achieved better performance on private data as this field is evolving quickly. We’ll show this method category in section 5.

Category stacking and ensembles

Lastly, hybrid approaches can be used that combine predictions from multiple method categories we have discussed. For example, teams in the BEA 2024 shared task achieved strong results by combining features with transformer embeddings or combining predictions from both regression and neural models (Yaneva et al, 2024). While this competition focused on item difficulty and response time the methods could also be applied to scoring.

More considerations

Transformers offer powerful methods for scoring free text but there are important implementation considerations to mention. These include that metrics like QWK should be consider human inter-rater agreement baselines for context. We also need to validate whether NLP models measure intended constructs or surface features that correlate with scores. Finally, even with frozen embeddings we do not really know what is happening inside the models, we therefore need to monitoring scoring over time to ensure validity evidence remains acceptable.

References

Chu, Y., He, P., Li, H., Han, H., Yang, K., Xue, Y., ... & Tang, J. (2025). Enhancing LLM-Based Short Answer Grading with Retrieval-Augmented Generation. arXiv preprint arXiv:2504.05276.

Do, H., Ryu, S., & Lee, G. (2024, November). Autoregressive multi-trait essay scoring via reinforcement learning with scoring-aware multiple rewards. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 16427-16438).

Hewlett Foundation. (2012). The Hewlett Foundation: Short answer scoring [Data set]. Kaggle. https://www.kaggle.com/c/asap-sas

Jiang, L., & Bosch, N. (2024). Short answer scoring with GPT-4. In *Proceedings of the Eleventh ACM Conference on Learning @ Scale (pp. 438-442). Association for Computing Machinery.

Mohler, M., & Mihalcea, R. (2009, March). Text-to-text semantic similarity for automatic short answer grading. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) (pp. 567-575).

Ormerod, C., & Kwako, A. (2024). Automated text scoring in the age of generative AI for the GPU-poor. arXiv preprint arXiv:2407.01873.

Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: Analysis. In: Paper presented at the National Council of Measurement in Education.

Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art automated scoring of essays. In: M. D. Shermis & J. Burstein (Eds.).

Yaneva, V., North, K., Baldwin, P., Ha, L. A., Rezayi, S., Zhou, Y., Ray Choudhury, S., Harik, P., & Clauser, B. (2024). Findings from the first shared task on automated prediction of difficulty and response time for multiple-choice questions. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications* (pp. 470–482). Association for Computational Linguistics.

Xie, J., Cai, K., Kong, L., Zhou, J., & Qu, W. (2022). Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics (pp. 2724-2733). International Committee on Computational Linguistics.

Next page

Scoring (1 of 3): Zero and Few-shot LLMs

Last page

Retrieval Augmented Generation (RAG)

Home

Psychometrics.ai