In this first approach to transformer scoring of constructed responses we carry out zero shot and few shot scoring and compare it to human scores of the same constructed responses. I will create and share a scoring implementation for zero shot and few shot LLM scoring that replicates a section of a recent paper Jiang & Bosch (2024).
In particular, we will use OpenAI to score the free text responses in a classic education data set, Automated Student Assessment Prize (ASAP) -Automated Essay Scoring (AES) competition. This data set was released by the Hewlett foundation on Kaggle in 2012 and has become a benchmark data set in the field.
While the study reported by Jiang & Bosch did not achieve state of the art (SOTA) performance in score prediction on this data set it the study is nonetheless important for several reasons. The first reason it is important while not achieving SOTA is that the performance of these models is dependent on the quality of the rubrics and the rubrics with this data set are variable.
Given the moderate quality of rubrics, it is very possible that others are attaining higher performance in automated scoring using better developed rubrics. In IO psychology assessment, for example, the quality of the indicators in well designed assessment processes is usually higher. This method also does not need any training data unlike methods in other constructed response scoring categories we will explore.
Accessing the Kaggle data and our code
If you want to follow you'll need to visit Kaggle to get the data, as the rules are for the Kaggle competition that the data can’t be shared elsewhere. To get a copy of the data for yourself, visit the now closed competition, accept the competition rules and request a Kaggle token. You will need to use the token with the download syntax provided at the Kaggle size to download the zip from the command line on your local machine and then you can open the zip for the data files.
I chose to try to replicate results for essay 10, which is a grade level 8 science. This had 1640 scored responses for training and a hold out sample of 548 responses which were originally unscored as contestants needed to submit their predicted scores. As with LLM one shot and few shot there is not training, we focus here on simply LLM scoring a sample of the responses in the original file. We choose 100 examples here as the original study did. Experimentation costs quickly add up because the rubric and scored examples with each API call. To get the full experiment completed including trial and error was about $US40.
My code for the replication and the original code of Jiang and Bosch (2024) are both available on GitHub. The strategy I followed to code my version was to read the paper once to get to about reasonable level of understanding, look at the data file structures from Kaggle to understand variable names and file structures, and then use AI to assist with coding given what we are trying to do is relatively straight-forward.
This process turned out to be effective for achieving the baseline performance without scored examples but the approach was more temperamental at recovering the accuracy boost from including scored examples. Upon switching out my scored examples and replacing with the exact scored examples Jiang & Bosch provided but using my code and trying a few random seeds for the sampling I observed the boost they described (albeit not as strong). The authors themselves are cautious about their claims around the increased accuracy of examples.
Overall, I conclude that the effects they reported are reliable for this question as they are broadly replicable using different code on a different subsample of the data.
Process and code
Here the code used for the replication is presented in the core stages. This first code block imports required libraries, initializes the OpenAI client using an API key from the environment, verifies that the key is set, and confirms successful setup.
This next code loads the TSV dataset into a pandas DataFrame and prints its size, column names, and the first row to inspect the data structure.
This code block filters the dataset to only Question 10, summarizes its score distribution, and prints one sample response for each score level. Note that Jiang and Bosch code ran across all essays so their code had sub-folder structuring to call data sets. Here this is not needed because we only study essay 10.
This code block stores the Question 10 prompt, rubric, and metadata as variables and prints a confirmation that the information was loaded. We use placeholders for the question which is available in the Kaggle data set.
This code defines (but does not yet run) a function that will later be called to send a prompt to GPT-4 and return the model’s predicted score.
def score_with_gpt4(prompt):
"""Call GPT-4 to score a response"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
max_tokens=10
)
return response.choices[0].message.content
print("✓ GPT-4 scoring function created!")This code randomly stratifies 100 Q10 responses by score distribution, scores them with GPT-4 by calling the earlier function, computes accuracy and quadratic weighted kappa, and saves the results for later analysis. This gets us to the results for part one of the approach without scoring examples.
Now we need to recreate the prompt but give it scoring examples for the comparison. Here I initially did not have success at getting the precision increase until I tried two things in combination (so it is not clear which drove the increase), added the precise scored examples used by the authors and tried a different random seed.
Finally, we run the experiment again, using the exact same sample as we used in part 1 and using the scoring examples.
Results
Panel A shows model accuracy improves from 0.710 (no examples) to 0.730 (with examples), showing a small performance gain from adding examples.
Panel B shows Quadratic Weighted Kappa increases from 0.733 to 0.750 with examples, indicating better agreement with human graders.
Panel C shows when examples are included in the prompt the model makes fewer big scoring errors (e.g., 0 when the true score is 2) even though small errors still occur at similar rates.
Panel D shows correct predictions dominate the diagonal in the confusion matrix without examples, but the model frequently confuses adjacent scores.
Panel E shows that with examples, the fiagonal accuracy increases slightly, especially for score 1 and score 2, indicating improved classification with examples.
Panel F is a score-shift plot that shows most responses retain the same score across conditions (largest bubbles on the diagonal), with some shifts when examples are added.
References
Jiang, L., & Bosch, N. (2024, July). Short answer scoring with GPT-4. In Proceedings of the Eleventh ACM Conference on Learning@ Scale (pp. 438-442).
Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: Analysis. In: Paper presented at the National Council of Measurement in Education.
Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art automated scoring of essays. In: M. D. Shermis & J. Burstein (Eds.).
The Learning Agency Lab. (n.d.). Automated Student Assessment Prize (ASAP) dataset. Kaggle.
Next page
Ground truth in AI psychometrics
Previous page
Crash course in transformer-era automated scoring
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).