The Invisible Leash: Why RLVR May Not Escape Its Origin

📅 2025-07-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work investigates whether Reinforcement Learning with Verifiable Rewards (RLVR) genuinely expands the capability frontier of reasoning models or merely conservatively reweights high-reward outputs already present in the base model’s output distribution. Method: We conduct theoretical analysis and large-scale token-level and answer-level sampling experiments to characterize RLVR’s behavior under verifiable reward signals. Contribution/Results: We formally establish that RLVR is fundamentally constrained by the support set of the base model: it cannot generate novel solutions with zero initial probability, but only redistributes probability mass within the existing support. We identify a pronounced entropy–reward trade-off—increased answer concentration comes at the cost of reduced generation diversity. Although pass@1 improves, the effective solution space contracts, systematically excluding some originally accessible correct answers. This work provides the first formal characterization of RLVR’s support-set conservation and inherent conservatism, empirically revealing its fundamental limitation in discovering innovative solutions.

Technology Category

Application Category

📝 Abstract

Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.

Problem

Research questions and friction points this paper is trying to address.

RLVR may not expand reasoning boundaries, only amplify known outputs

RLVR constrained by base model's support, limiting original solutions

RLVR reduces answer diversity despite increasing token-level entropy

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR constrained by base model's support

Entropy-reward tradeoff narrows exploration

Hybrid strategies needed for underrepresented solutions

🔎 Similar Papers

No similar papers found.