SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing subjective short-answer scoring (SAS) methods typically yield coarse-grained scores lacking interpretability, while LLM-based zero-shot evaluation suffers from bias, low human agreement, and opaque decision-making. Method: We propose a fine-grained, stepwise SAS evaluation framework and introduce SAS-Bench—the first open-source, education-driven benchmark comprising 1,030 multi-disciplinary authentic test items, 4,109 student responses, and expert-annotated error types and stepwise scores. Our approach incorporates a domain-expert–defined error taxonomy, integrated with educational measurement principles, bias diagnostics, and explainability analysis. Contribution/Results: Experiments demonstrate that few-shot prompting significantly improves scoring accuracy—particularly for science items. SAS-Bench establishes a high-fidelity, reproducible, and auditable infrastructure for evaluating LLMs in educational assessment, advancing both validity and transparency in automated short-answer scoring.

Technology Category

Application Category

📝 Abstract
Subjective Answer Grading (SAG) plays a crucial role in education, standardized testing, and automated assessment systems, particularly for evaluating short-form responses in Short Answer Scoring (SAS). However, existing approaches often produce coarse-grained scores and lack detailed reasoning. Although large language models (LLMs) have demonstrated potential as zero-shot evaluators, they remain susceptible to bias, inconsistencies with human judgment, and limited transparency in scoring decisions. To overcome these limitations, we introduce SAS-Bench, a benchmark specifically designed for LLM-based SAS tasks. SAS-Bench provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types derived from real-world subject-specific exams. This benchmark facilitates detailed evaluation of model reasoning processes and explainability. We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts. Furthermore, we conduct comprehensive experiments with various LLMs, identifying major challenges in scoring science-related questions and highlighting the effectiveness of few-shot prompting in improving scoring accuracy. Our work offers valuable insights into the development of more robust, fair, and educationally meaningful LLM-based evaluation systems.
Problem

Research questions and friction points this paper is trying to address.

Existing SAG methods lack fine-grained scoring and reasoning details
LLMs show bias and inconsistency in short answer scoring
Need for robust benchmarks to evaluate LLM-based SAS tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAS-Bench enables fine-grained step-wise scoring
Includes expert-annotated error categories for transparency
Uses few-shot prompting to improve scoring accuracy
🔎 Similar Papers
No similar papers found.
P
Peichao Lai
Peking University
K
Kexuan Zhang
Fuzhou University
Y
Yi Lin
Hunan University
L
Linyihan Zhang
Fuzhou University
Feiyang Ye
Feiyang Ye
University of Technology Sydney, Ph.D student
Multi-Task Learning
J
Jinhao Yan
Fuzhou University
Y
Yanwei Xu
Peking University
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Yilei Wang
Yilei Wang
Alibaba Cloud
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
B
Bin Cui
Peking University