Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RL-based verifiable reasoning (RLVR) methods apply uniform advantage signals to all tokens, ignoring the uncertainty inherent in high-risk decisions during reasoning—leading to insufficient exploration and entropy collapse. To address this, we propose an uncertainty-aware advantage shaping framework that incorporates both response-level model confidence and token-level logit determinacy into the advantage function. Our method introduces a two-stage dynamic regulation mechanism: response-level confidence modulation and token-level determinacy penalization. Implemented within a model-free RL framework, it enables fine-grained credit assignment, substantially improving exploration efficiency and reasoning diversity. Experiments demonstrate consistent superiority over state-of-the-art RLVR baselines across five mathematical reasoning benchmarks, with robust performance across LLMs ranging from 1.5B to 7B parameters. The approach effectively mitigates entropy collapse and yields significant gains in final reward performance.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model's internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using the model's overall self-confidence, and then applies a token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly outperforms strong RLVR baselines across multiple model scales, including 1.5B and 7B. Our analysis confirms that UCAS not only achieves higher rewards but also promotes greater reasoning diversity and successfully mitigates entropy collapse.
Problem

Research questions and friction points this paper is trying to address.

Improves credit assignment in RLVR using uncertainty signals
Addresses inefficient exploration and entropy collapse in reasoning
Balances exploration-exploitation trade-off for mathematical reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages model's internal uncertainty for credit assignment
Modulates advantage using overall self-confidence and token certainty
Encourages exploration of high-uncertainty correct paths
C
Can Xie
Kuaishou Technology
R
Ruotong Pan
Kuaishou Technology
X
Xiangyu Wu
Kuaishou Technology
Y
Yunfei Zhang
Kuaishou Technology
Jiayi Fu
Jiayi Fu
Nankai University
T
Tingting Gao
Kuaishou Technology
Guorui Zhou
Guorui Zhou
Unknown affiliation
Recommender System,Advertising,Artificial Intelligence,Machine Learning,NLP