Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

In RLVR (Reinforcement Learning from Verbal Rewards), a fundamental tension exists between exploration and exploitation—specifically, the seemingly contradictory mechanisms of reward misalignment (i.e., “false rewards”) and entropy minimization both improve LLM mathematical reasoning performance. Method: The authors formalize a reward misalignment model and conduct systematic analyses—including policy entropy profiling, reward truncation modeling, and contamination control experiments—to disentangle their effects. Contribution/Results: They demonstrate that false rewards do not rely on model contamination but instead induce reward truncation, thereby reducing policy entropy and increasing output determinism. Crucially, entropy reduction alone is insufficient; it must be coupled with specific reward structures to yield gains. This work provides the first mechanistic explanation of how reward misalignment and entropy minimization synergize, yielding interpretable, principled guidelines for RLVR training. Empirically, these insights enable more stable and reproducible performance improvements on mathematical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.

Problem

Research questions and friction points this paper is trying to address.

Explores exploration-exploitation trade-off in RLVR for LLMs

Investigates how spurious rewards and entropy affect reasoning performance

Explains mechanisms behind spurious rewards improving model outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spurious rewards suppress exploitation via unrelated outcomes

Clipping bias reduces policy entropy for confident outputs

Reward-misalignment model explains spurious reward performance gains

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL