Process Supervision of Confidence Margin for Calibrated LLM Reasoning

📅 2026-04-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

174K/year
🤖 AI Summary
Large language models often exhibit overconfidence during reasoning due to outcome-oriented reward mechanisms, leading to hallucinations, miscalibrated confidence, and inefficient resource usage. This work proposes the RLCM framework, which innovatively introduces process supervision within individual reasoning trajectories by augmenting rewards with confidence bounds. Rather than directly aligning model confidence with true probabilities, RLCM amplifies the confidence gap between correct and incorrect intermediate steps. This approach jointly optimizes reasoning accuracy and confidence reliability, significantly improving calibration across mathematical, code, logical, and scientific benchmarks while maintaining or enhancing overall accuracy. Moreover, it enables efficient conformal risk control and confidence-weighted aggregation.

Technology Category

Application Category

📝 Abstract
Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbf{RLCM}), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We further show that, with calibrated confidence signals, the resulting models enable more efficient conformal risk control and effective confidence-weighted aggregation.
Problem

Research questions and friction points this paper is trying to address.

overconfidence
hallucination
calibration
confidence reliability
reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence calibration
reinforcement learning
process supervision
confidence margin
LLM reasoning