OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing On-Policy Self-Distillation (OPSD) struggles to improve accuracy in mathematical reasoning tasks and can even degrade performance. This work reveals that OPSD fundamentally functions as a response compression mechanism rather than an error-correction one: applying it solely on correct reasoning trajectories significantly shortens output length while preserving accuracy, whereas training on incorrect trajectories harms performance. To address this, we reposition OPSD as a post-reinforcement learning compression tool and introduce a novel training pipeline—SFT → RLVR → OPSD—that integrates supervised fine-tuning, reinforcement learning with verifiable rewards (RLVR), and token-level credit assignment conditioned on privileged context. This approach enhances response conciseness without compromising reasoning accuracy.

📝 Abstract

On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However, this promise does not carry over to thinking-enabled mathematical reasoning, where reported accuracy gains shrink and sometimes turn negative. We hypothesize that hindsight supervision can specify better token-level alternatives in short thinking-disabled outputs, but in long thinking-enabled traces it more readily identifies redundancy than supplies better replacements. To test this, we applied OPSD separately to correct and incorrect rollout groups, so that compression and correction can be observed in isolation. Our results show that in thinking-enabled mathematical reasoning, OPSD behaves most reliably as a compression mechanism rather than a correction mechanism: training only on correct rollouts preserves accuracy while substantially shortening responses, whereas training only on incorrect rollouts damages accuracy. In light of these findings, we propose a revised post-training pipeline for thinking-enabled mathematical reasoning: SFT then RLVR then OPSD.

Problem

Research questions and friction points this paper is trying to address.

On-Policy Self-Distillation

Reinforcement Learning with Verifiable Rewards

mathematical reasoning

reasoning models

post-training compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Self-Distillation

Reinforcement Learning with Verifiable Rewards

Mathematical Reasoning