When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the unreliability of large language models (LLMs) in automated scoring and the difficulty in assessing their prediction trustworthiness. The authors propose a confidence-based selective automation framework that automatically accepts high-confidence predictions while routing low-confidence ones to human reviewers. They systematically evaluate three confidence estimation methods—self-reported confidence, self-consistency voting, and token probability—across seven LLMs ranging from 4B to 120B parameters and three educational datasets. Results show that self-reported confidence significantly outperforms the alternatives (average ECE = 0.166 vs. 0.229) with the lowest computational overhead; notably, GPT-OSS-120B achieves the best calibration (ECE = 0.100, AUC = 0.668). The study also reveals non-uniform improvements in calibration with model scale and a pervasive overconfidence bias across models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38\% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28\% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor'' that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{https://github.com/sonkar-lab/llm_grading_calibration}{here}.

Problem

Research questions and friction points this paper is trying to address.

LLM grading

confidence calibration

automated assessment

reliability prediction

educational evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence calibration

LLM grading

selective automation