Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how spurious rewards in Reinforcement Learning with Verifiable Rewards (RLVR) obscure the tendency of large language models to rely on memorization rather than genuine reasoning. Through mechanistic analysis, the study reveals that RLVR activates a “memory shortcut” within the model: while spurious rewards reduce answer token perplexity, they degrade prompt coherence. Leveraging interpretability techniques—including Path Patching, Logit Lens, and neural differential equations—the authors identify, for the first time, a causal circuit composed of Functional Anchors in intermediate layers (L18–20) and Structural Adapters in subsequent layers (L21+). By selectively modulating specific MLP pathways within this circuit, they demonstrate bidirectional intervention over the memory contamination effect, offering a mechanistic solution to mitigate data pollution in RLVR training.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a"Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts.
Problem

Research questions and friction points this paper is trying to address.

Spurious Rewards
Memorization Shortcuts
Reinforcement Learning with Verifiable Rewards
Data Contamination
Reasoning Bypass
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spurious Rewards Paradox
Memorization Shortcuts
Anchor-Adapter Circuit
Reinforcement Learning with Verifiable Rewards
Mechanistic Interpretability