Spurious Rewards: Rethinking Training Signals in RLVR

📅 2025-06-12

📈 Citations: 8

✨ Influential: 1

career value

195K/year

🤖 AI Summary

Reinforcement learning with verifiable rewards (RLVR) surprisingly improves mathematical reasoning performance even under spurious reward signals (e.g., random, formatting-based, or incorrect labels), yet the underlying mechanism remains unclear. Method: We analyze RLVR’s effect on Qwen2.5-Math and contrast it with Llama3 and OLMo2, employing single-step reinforcement and majority-voting pseudo-rewards on MATH-500. Contribution/Results: We find RLVR activates latent reasoning representations embedded in Qwen2.5-Math’s pretraining—without requiring ground-truth rewards—by specifically amplifying “code-style reasoning,” a critical inference behavior whose frequency rises from 65% to >90%. This phenomenon is model-specific and absent in other architectures. RLVR achieves a 27.1% accuracy gain on MATH-500, approaching the 29.1% gain attained with true rewards. Our work is the first to demonstrate that RLVR enhances generalization via model-specific restructuring of reasoning patterns, establishing a novel paradigm for unsupervised reasoning alignment.

Technology Category

Application Category

📝 Abstract

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting) -- nearly matching the 29.1% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. In particular, we find code reasoning -- thinking in code without actual code execution -- to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 65% to over 90%, even with spurious rewards. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining, although the exact mechanism remains a topic for future work. We suggest that future RLVR research should possibly be validated on diverse models rather than a single de facto choice, as we show that it is easy to get significant performance gains on Qwen models even with completely spurious reward signals.

Problem

Research questions and friction points this paper is trying to address.

Investigates spurious rewards' impact on RLVR performance

Explores varying reward effects across different model families

Examines RLVR's role in enhancing pretrained reasoning representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR improves reasoning with spurious rewards

Code reasoning increases significantly post-RLVR

Diverse model validation needed for RLVR

🔎 Similar Papers

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret