🤖 AI Summary
In mathematical reasoning, only final answers are verifiable, while intermediate reasoning steps lack supervision—causing Reinforcement Learning for Verifiable Reasoning (RLVR) to degenerate into rote memorization. Method: This paper proposes MR-RLVR, the first RLVR framework integrating masked language modeling and reasoning-step reordering to construct process-level self-supervised reward signals, thereby reducing reliance on final-answer verification and mitigating chain-of-thought degradation. The method comprises two stages: (1) self-supervised pretraining via masked language modeling and step reordering, and (2) reinforcement learning fine-tuning guided by result-verifiable rewards. Contribution/Results: On benchmarks including AIME24 and AIME25, MR-RLVR achieves an average +9.86% improvement in Pass@1, significantly enhancing long-horizon reasoning capability in result-only-verifiable settings.
📝 Abstract
Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.