Confidence Estimation in Automatic Short Answer Grading with LLMs

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

168K/year
🤖 AI Summary
This study addresses the lack of reliable confidence estimation in large language models (LLMs) for automated short-answer scoring, a limitation that undermines the safety and efficacy of human-AI collaborative educational assessment. To this end, the authors propose a hybrid confidence framework that integrates intrinsic model-based confidence signals—derived from verbalization, latent-space representations, and consistency strategies—with aleatoric uncertainty quantified through semantic embedding clustering of student responses to capture inherent data heterogeneity. Experimental results demonstrate that the proposed approach significantly outperforms baseline methods relying on single-source confidence estimates, yielding more reliable confidence calibration and improved performance in selective scoring. This work thus establishes a novel paradigm for trustworthy automated scoring systems.
📝 Abstract
Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.
Problem

Research questions and friction points this paper is trying to address.

Automatic Short Answer Grading
Confidence Estimation
Large Language Models
Aleatoric Uncertainty
Human-AI Collaboration
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence estimation
large language models
automatic short answer grading
aleatoric uncertainty
human-in-the-loop assessment
L
Longwei Cong
DIPF | Leibniz Institute for Research and Information in Education, 60323 Frankfurt am Main, Germany
S
Sonja Hahn
DIPF | Leibniz Institute for Research and Information in Education, 60323 Frankfurt am Main, Germany
S
Sebastian Gombert
DIPF | Leibniz Institute for Research and Information in Education, 60323 Frankfurt am Main, Germany
L
Leon Camus
DIPF | Leibniz Institute for Research and Information in Education, 60323 Frankfurt am Main, Germany
Hendrik Drachsler
Hendrik Drachsler
Professor for Computer Science, DIPF | Leibniz Institute & Goethe University, Frankfurt
Learning AnalyticsAI in EducationAssessment and FeedbackLearning DesignMedical Education
U
Ulf Kroehne
DIPF | Leibniz Institute for Research and Information in Education, 60323 Frankfurt am Main, Germany; Chemnitz University of Technology, 09111 Chemnitz, Germany