S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding fragmentation in automated scoring research, which has treated long-form and short-answer tasks in isolation, lacking a unified evaluation framework. To bridge this gap, we propose the first open-source, extensible web-based benchmark platform that integrates 14 diverse scoring datasets, establishing a standardized evaluation protocol across paradigms and datasets while supporting continuous integration of new data and assessment methodologies. Leveraging large language models (LLMs), we systematically evaluate various prompting strategies, example selection techniques, and cross-dataset transfer approaches. Our experiments reveal significant disparities in the reliability and generalization capabilities of current LLMs across different scoring tasks, underscoring the critical role of unified evaluation in advancing educational natural language processing.

Technology Category

Application Category

📝 Abstract
Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.
Problem

Research questions and friction points this paper is trying to address.

Automated Essay Scoring
Automatic Short Answer Grading
generalization
educational NLP
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

benchmark
generalization
automated grading
educational NLP
large language models
🔎 Similar Papers
No similar papers found.