🤖 AI Summary
In verifiable reward reinforcement learning (RLVR), sparse rewards impede effective guidance of intermediate reasoning steps. To address this, we propose the Progressive Uncertainty-Calibrated Reward (PUCR) mechanism, which generates dense intrinsic rewards based on the model’s evolving confidence in the correct answer during inference. PUCR explicitly transforms the verification objective into a monotonic confidence-increase optimization goal. By embedding this reward structure as an inductive bias, PUCR effectively constrains the search space, enhancing both policy exploration efficiency and reasoning path quality. Experiments demonstrate that PUCR significantly accelerates convergence—achieving reward saturation earlier—and attains state-of-the-art performance across multiple mathematical and logical reasoning benchmarks, including GSM8K, MATH, and AQuA. These results validate PUCR’s effectiveness and generalizability in improving large language models’ reasoning capabilities.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved LLM reasoning, but its sparse, outcome-based reward provides no guidance for intermediate steps, slowing exploration. We propose Progressively Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer. PACR encodes the inductive bias that, along a well-formed reasoning trajectory, the probability of the ground-truth answer should have a generally ascending trend. We provide empirical and theoretical analysis validating that such an inductive bias constrains the exploration search space to regions richer in logically sound reasoning. We demonstrate that PACR accelerates exploration, reaches reward saturation with fewer trajectories, and yields improvements on multiple benchmarks. Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.