When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how reinforcement learning (RL) during cross-domain transfer suppresses exploratory reasoning primitives, thereby impairing performance on complex mathematical problem solving. By exclusively employing constraint-satisfaction puzzles to conduct supervised fine-tuning and RL post-training on a 7B language model, the study introduces a reasoning-primitive-based analytical framework that, for the first time, reveals RL’s detrimental effect on exploratory reasoning. To mitigate this issue, the authors propose a novelty reward mechanism grounded in reference model perplexity, combined with a reasoning primitive segmentation approach leveraging nine span classifiers and motive extraction, as well as the GSPO algorithm. Remarkably, without using any mathematical training data, this method boosts pass@32 on OlymMATH-Hard from 16.0% to 36.0%, achieving a 20-percentage-point improvement over the baseline.
📝 Abstract
Reinforcement learning using verifiable rewards (RLVR) improves LLM reasoning, but the conditions under which it transfers across domains -- and why it does so -- remain under-explored. We study cross-domain transfer in a 7B model whose SFT and RL post-training stages use only constraint-satisfaction puzzles, with no mathematics problems in the post-training data. To analyze how transfer emerges, we introduce a reasoning primitive-level framework that combines a 9-class span classifier with motif extraction, allowing us to segment chain-of-thought traces into primitive motifs and track their evolution across training stages and domains. We find that puzzle SFT induces a reasoning-primitive vocabulary, yielding a $+7$pp \texttt{pass@32} gain on OlymMATH-Hard. Vanilla GSPO then composes these primitives into longer compute-verify chains, adding a further $+6$pp. However, this RL stage also suppresses exploratory primitives such as \textit{hypothesize} and \textit{backtrack}. To address this, we introduce a novelty bonus that rewards diverse correct rollouts, using perplexity under the reference model as a signal. This restores recovery primitives during RL and adds a further $+7$pp \texttt{pass@32} relative to vanilla GSPO. Finally, the end-to-end recipe raises the hard-math capability ceiling from $16.0\%$ at the OLMo3-7B-Instruct-SFT base to $36.0\%$, without adding any mathematics problems during the SFT or RL stages.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
reasoning diversity
cross-domain transfer
mathematical reasoning
vocabulary suppression
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning primitives
cross-domain transfer
novelty bonus
RLVR
chain-of-thought analysis
🔎 Similar Papers
No similar papers found.