🤖 AI Summary
Large language models often exhibit overconfidence in decision-making tasks, leading to unreliable confidence estimates that undermine trust in their outputs within downstream systems. To address the limitation of conventional reinforcement learning approaches—where decision tokens lack explicit confidence information—this work proposes a calibration-aware reinforcement learning method that, for the first time, explicitly integrates calibration objectives into the reinforcement learning loss function. This approach directly optimizes the probability distribution of decision tokens to jointly enhance both accuracy and confidence reliability. Experimental results demonstrate that the proposed method maintains high accuracy comparable to standard RLVR while significantly mitigating overconfidence, achieving up to a 9-point reduction in Expected Calibration Error (ECE).
📝 Abstract
Large language models (LLMs) are increasingly deployed in decision-making tasks, where not only accuracy but also reliable confidence estimates are essential. Well-calibrated confidence enables downstream systems to decide when to trust a model and when to defer to fallback mechanisms. In this work, we conduct a systematic study of calibration in two widely used fine-tuning paradigms: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). We show that while RLVR improves task performance, it produces extremely overconfident models, whereas SFT yields substantially better calibration, even under distribution shift, though with smaller performance gains. Through targeted experiments, we diagnose RLVR's failure, showing that decision tokens act as extraction steps of the decision in reasoning traces and do not carry confidence information, which prevents reinforcement learning from surfacing calibrated alternatives. Based on this insight, we propose a calibration-aware reinforcement learning formulation that directly adjusts decision-token probabilities. Our method preserves RLVR's accuracy level while mitigating overconfidence, reducing ECE scores up to 9 points.