Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In mathematical reasoning, only final answers are verifiable, while intermediate reasoning steps lack supervision—causing Reinforcement Learning for Verifiable Reasoning (RLVR) to degenerate into rote memorization. Method: This paper proposes MR-RLVR, the first RLVR framework integrating masked language modeling and reasoning-step reordering to construct process-level self-supervised reward signals, thereby reducing reliance on final-answer verification and mitigating chain-of-thought degradation. The method comprises two stages: (1) self-supervised pretraining via masked language modeling and step reordering, and (2) reinforcement learning fine-tuning guided by result-verifiable rewards. Contribution/Results: On benchmarks including AIME24 and AIME25, MR-RLVR achieves an average +9.86% improvement in Pass@1, significantly enhancing long-horizon reasoning capability in result-only-verifiable settings.

Technology Category

Application Category

📝 Abstract
Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.
Problem

Research questions and friction points this paper is trying to address.

Improving mathematical reasoning in LLMs when only final answers are verifiable
Addressing token-level SFT degeneration into rote memorization rather than reasoning
Enhancing RLVR scalability for mathematical corpora with unverifiable intermediate steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses masked-then-fill for self-supervised rewards
Applies step reordering to extract reasoning signals
Combines self-supervised pretraining with RL fine-tuning
🔎 Similar Papers
No similar papers found.