๐ค AI Summary
This study addresses the challenge of diminished scoring consistency in automated short-answer assessment, particularly for responses of medium quality, where ambiguous model judgments can compromise evaluation fairness. We systematically investigate the scoring consistency of large language modelsโGPT-5.2, GPT-4o, and Claude Opus 4.5โunder few-shot settings across answers of varying quality, introducing a quality-conditioned fairness evaluation perspective. Comparisons with a fine-tuned BERT encoder and human expert ratings reveal that human scorers exhibit high stability, while AI models perform reliably on extreme-quality responses but suffer significant degradation on medium-quality ones. This degradation is mitigated with increased task-adaptation data, with the fine-tuned model achieving the best performance. Our work is the first to identify and quantitatively characterize the consistency degradation phenomenon specifically induced by medium-quality responses.
๐ Abstract
Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.