Process Supervision of Confidence Margin for Calibrated LLM Reasoning

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Large language models often exhibit overconfidence during reasoning due to outcome-oriented reward mechanisms, leading to hallucinations, miscalibrated confidence, and inefficient resource usage. This work proposes the RLCM framework, which innovatively introduces process supervision within individual reasoning trajectories by augmenting rewards with confidence bounds. Rather than directly aligning model confidence with true probabilities, RLCM amplifies the confidence gap between correct and incorrect intermediate steps. This approach jointly optimizes reasoning accuracy and confidence reliability, significantly improving calibration across mathematical, code, logical, and scientific benchmarks while maintaining or enhancing overall accuracy. Moreover, it enables efficient conformal risk control and confidence-weighted aggregation.

Technology Category

Application Category

📝 Abstract

Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbf{RLCM}), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We further show that, with calibrated confidence signals, the resulting models enable more efficient conformal risk control and effective confidence-weighted aggregation.

Problem

Research questions and friction points this paper is trying to address.

overconfidence

hallucination

calibration

confidence reliability

reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence calibration

reinforcement learning

process supervision