PACR: Progressively Ascending Confidence Reward for LLM Reasoning

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

In verifiable reward reinforcement learning (RLVR), sparse rewards impede effective guidance of intermediate reasoning steps. To address this, we propose the Progressive Uncertainty-Calibrated Reward (PUCR) mechanism, which generates dense intrinsic rewards based on the model’s evolving confidence in the correct answer during inference. PUCR explicitly transforms the verification objective into a monotonic confidence-increase optimization goal. By embedding this reward structure as an inductive bias, PUCR effectively constrains the search space, enhancing both policy exploration efficiency and reasoning path quality. Experiments demonstrate that PUCR significantly accelerates convergence—achieving reward saturation earlier—and attains state-of-the-art performance across multiple mathematical and logical reasoning benchmarks, including GSM8K, MATH, and AQuA. These results validate PUCR’s effectiveness and generalizability in improving large language models’ reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved LLM reasoning, but its sparse, outcome-based reward provides no guidance for intermediate steps, slowing exploration. We propose Progressively Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer. PACR encodes the inductive bias that, along a well-formed reasoning trajectory, the probability of the ground-truth answer should have a generally ascending trend. We provide empirical and theoretical analysis validating that such an inductive bias constrains the exploration search space to regions richer in logically sound reasoning. We demonstrate that PACR accelerates exploration, reaches reward saturation with fewer trajectories, and yields improvements on multiple benchmarks. Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.

Problem

Research questions and friction points this paper is trying to address.

Sparse reward signals slow exploration in LLM reasoning training

Model lacks guidance for intermediate reasoning steps during training

Need dense intrinsic rewards to accelerate reasoning skill acquisition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense reward based on model's belief progression

Encodes ascending probability trend for correct answers

Accelerates exploration with model-intrinsic shaping signals

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting

2024-10-10Citations: 0

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Authors to Follow