🤖 AI Summary
This work addresses a critical limitation of large language models in multi-step reasoning: their tendency to commit prematurely to answers, leading to logical inconsistencies. The authors propose Progressive Confidence Shaping, a novel method that treats premature confidence as an unsupervised signal and employs a reinforcement learning objective based on confidence evolution to dynamically encourage the model to defer high-confidence decisions. Specifically, the approach rewards gradually increasing confidence while penalizing early commitment. Notably, it requires neither external annotations nor process-based reward models, making it broadly applicable across reasoning tasks. Experiments demonstrate substantial gains: on the Countdown task, accuracy improves by 42.0 percentage points (a 3.2× increase) with a 48 pp reduction in erroneous reasoning; on AIME, Pass@64 rises by 6.6 pp, alongside marked improvements in reasoning faithfulness and transparency.
📝 Abstract
Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from additional test-time compute. Improving reasoning quality directly would require process reward models, but the step-level annotations needed to train them are expensive and scarce. We find such a signal in how the model's confidence evolves during reasoning: premature confidence, the tendency to commit to an answer early and use the remaining tokens to rationalize it, strongly predicts flawed reasoning across tasks and model scales. We exploit this in progressive confidence shaping, a reinforcement learning objective that trains models to update their confidence as they reason rather than commit early -- rewarding gradual confidence growth and penalizing early commitment, with no external labels or reward models. The method improves accuracy and reasoning quality from 1.5B to 8B parameters across arithmetic (Countdown), math (DAPO, AIME), and science (ScienceQA): on Countdown, accuracy improves 3.2x (+42.0pp) and flawed reasoning drops 48pp; on AIME, Pass@64 improves 6.6pp. Consistent with this mechanism, the method also improves faithfulness: on a safety benchmark, our models more transparently surface misleading content in their reasoning traces rather than concealing it. Controlled experiments reveal that the problem and its remedy scale together: premature confidence grows with model size and task difficulty, and so do the gains from addressing it.