OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

๐Ÿ“… 2026-05-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

191K/year
๐Ÿค– AI Summary
Existing On-Policy Self-Distillation (OPSD) struggles to improve accuracy in mathematical reasoning tasks and can even degrade performance. This work reveals that OPSD fundamentally functions as a response compression mechanism rather than an error-correction one: applying it solely on correct reasoning trajectories significantly shortens output length while preserving accuracy, whereas training on incorrect trajectories harms performance. To address this, we reposition OPSD as a post-reinforcement learning compression tool and introduce a novel training pipelineโ€”SFT โ†’ RLVR โ†’ OPSDโ€”that integrates supervised fine-tuning, reinforcement learning with verifiable rewards (RLVR), and token-level credit assignment conditioned on privileged context. This approach enhances response conciseness without compromising reasoning accuracy.
๐Ÿ“ Abstract
On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However, this promise does not carry over to thinking-enabled mathematical reasoning, where reported accuracy gains shrink and sometimes turn negative. We hypothesize that hindsight supervision can specify better token-level alternatives in short thinking-disabled outputs, but in long thinking-enabled traces it more readily identifies redundancy than supplies better replacements. To test this, we applied OPSD separately to correct and incorrect rollout groups, so that compression and correction can be observed in isolation. Our results show that in thinking-enabled mathematical reasoning, OPSD behaves most reliably as a compression mechanism rather than a correction mechanism: training only on correct rollouts preserves accuracy while substantially shortening responses, whereas training only on incorrect rollouts damages accuracy. In light of these findings, we propose a revised post-training pipeline for thinking-enabled mathematical reasoning: SFT then RLVR then OPSD.
Problem

Research questions and friction points this paper is trying to address.

On-Policy Self-Distillation
Reinforcement Learning with Verifiable Rewards
mathematical reasoning
reasoning models
post-training compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Self-Distillation
Reinforcement Learning with Verifiable Rewards
Mathematical Reasoning
Model Compression
Post-Training Pipeline
๐Ÿ”Ž Similar Papers