Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

๐Ÿ“… 2025-09-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Automated essay scoring (AES) systems face limited adoption in high-stakes assessment due to insufficient score confidence estimation and interpretability. Method: This paper proposes a novel uncertainty calibration framework that jointly integrates conformal prediction with uncertainty-aware accuracy (UAcc), enabling set-valued predictions with guaranteed coverage for AESโ€”first such application in the domain. We fine-tune Llama-3-8B and Qwen-2.5-3B on multi-source datasets (ASAP, TOEFL11, Cambridge-FCE) under this framework. Contribution/Results: At a 90% risk level, our method strictly satisfies the target coverage guarantee while yielding compact prediction sets and significantly improved UAcc. It facilitates teacher-in-the-loop evaluation and empirically validates the viability of medium-scale open-weight LLMs for trustworthy, interpretable AES. Our work establishes a new paradigm for deploying education-focused AI systems grounded in statistical reliability and pedagogical transparency.

Technology Category

Application Category

๐Ÿ“ Abstract
Automated Essay Scoring (AES) systems now reach near human agreement on some public benchmarks, yet real-world adoption, especially in high-stakes examinations, remains limited. A principal obstacle is that most models output a single score without any accompanying measure of confidence or explanation. We address this gap with conformal prediction, a distribution-free wrapper that equips any classifier with set-valued outputs and formal coverage guarantees. Two open-source large language models (Llama-3 8B and Qwen-2.5 3B) are fine-tuned on three diverse corpora (ASAP, TOEFL11, Cambridge-FCE) and calibrated at a 90 percent risk level. Reliability is assessed with UAcc, an uncertainty-aware accuracy that rewards models for being both correct and concise. To our knowledge, this is the first work to combine conformal prediction and UAcc for essay scoring. The calibrated models consistently meet the coverage target while keeping prediction sets compact, indicating that open-source, mid-sized LLMs can already support teacher-in-the-loop AES; we discuss scaling and broader user studies as future work.
Problem

Research questions and friction points this paper is trying to address.

Addressing lack of confidence measures in automated essay scoring
Applying conformal prediction for uncertainty calibration in LLMs
Evaluating reliability through uncertainty-aware accuracy metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conformal prediction for set-valued outputs
Fine-tuned open-source LLMs on diverse corpora
Uncertainty-aware accuracy metric UAcc
๐Ÿ”Ž Similar Papers
No similar papers found.