Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the prevailing view that the exploration-exploitation trade-off in Reinforcement Learning with Verifiable Rewards (RLVR) constitutes a fundamental limitation, arguing instead that it may arise as an artifact of suboptimal observation-level representation. By explicitly modeling a latent semantic state space, we formally prove— for the first time—that exploration and exploitation can be decoupled within this underlying space. Building on this insight, we introduce Effective Rank Velocity (ERV) and Acceleration (ERA) as dynamic representational metrics quantifying temporal changes in latent rank structure. We further design a dual-channel meta-controller guided by ERA to proactively modulate the advantage function, thereby synergistically enhancing both exploration and exploitation. Evaluated across multiple large language models and reasoning benchmarks, our approach achieves substantial performance gains—up to +21.4% absolute accuracy on the challenging Gaokao 2024 dataset—offering both a novel theoretical perspective on the RLVR trade-off and a practical, scalable solution.

Technology Category

Application Category

📝 Abstract
A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.
Problem

Research questions and friction points this paper is trying to address.

Re-examines the exploration-exploitation trade-off in RLVR as a measurement artifact
Proposes hidden-state analysis with Effective Rank derivatives to decouple exploration and exploitation
Introduces VERL method to synergistically enhance both exploration and exploitation capacities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shifting analysis to hidden-state space using Effective Rank
Introducing Effective Rank Velocity and Acceleration derivatives
Leveraging ERA as predictive meta-controller for dual incentives
🔎 Similar Papers
No similar papers found.
Fanding Huang
Fanding Huang
Tsinghua University
Semantic SegmentationTest-time AdaptationLarge Language Models
Guanbo Huang
Guanbo Huang
uestc
multi-modal learning
X
Xiao Fan
Tsinghua Shenzhen International Graduate School, Tsinghua University
Y
Yi He
Tsinghua Shenzhen International Graduate School, Tsinghua University
X
Xiao Liang
University of California, Los Angeles
X
Xiao Chen
Tsinghua Shenzhen International Graduate School, Tsinghua University
Q
Qinting Jiang
Tsinghua Shenzhen International Graduate School, Tsinghua University
Faisal Nadeem Khan
Faisal Nadeem Khan
Associate Professor, Tsinghua-Berkeley Shenzhen Institute, Tsinghua University
Machine Learning TechniquesDigital Signal ProcessingFiber-optic Communication Networks
Jingyan Jiang
Jingyan Jiang
Shen Zhen Technology University
Test-time adaptation, Embodied AI,Machine learning system
Z
Zhi Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University