Calibrating LLMs with Semantic-level Reward

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the challenge of unreliable uncertainty estimation in large language models (LLMs) when deployed in high-stakes domains such as healthcare and law. Existing reinforcement learning approaches relying on binary correctness rewards struggle to effectively calibrate model confidence. To overcome this limitation, the paper introduces the Calibrated Semantic Reward (CSR) framework, which, for the first time, designs a calibration-aware reward mechanism in semantic space by jointly modeling correctness and semantic consistency. CSR directly optimizes calibration without requiring explicit verbalized confidence outputs, thereby avoiding inconsistencies arising from surface-form variations in textual responses. Experimental results across multiple benchmarks—including HotpotQA, TriviaQA, MSMARCO, and NQ-Open—demonstrate that CSR significantly outperforms baseline methods, achieving up to a 40% reduction in Expected Calibration Error (ECE) and a 31% improvement in AUROC, while also exhibiting strong generalization in calibration performance.
📝 Abstract
As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.
Problem

Research questions and friction points this paper is trying to address.

calibration
large language models
uncertainty estimation
semantic consistency
confidence scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic calibration
uncertainty calibration
large language models
reinforcement learning
semantic reward