🤖 AI Summary
This work addresses the unreliability of large language models (LLMs) in automated scoring and the difficulty in assessing their prediction trustworthiness. The authors propose a confidence-based selective automation framework that automatically accepts high-confidence predictions while routing low-confidence ones to human reviewers. They systematically evaluate three confidence estimation methods—self-reported confidence, self-consistency voting, and token probability—across seven LLMs ranging from 4B to 120B parameters and three educational datasets. Results show that self-reported confidence significantly outperforms the alternatives (average ECE = 0.166 vs. 0.229) with the lowest computational overhead; notably, GPT-OSS-120B achieves the best calibration (ECE = 0.100, AUC = 0.668). The study also reveals non-uniform improvements in calibration with model scale and a pervasive overconfidence bias across models.
📝 Abstract
Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38\% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28\% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor'' that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{https://github.com/sonkar-lab/llm_grading_calibration}{here}.