Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

174K/year
πŸ€– AI Summary
This work addresses the inefficiency in long-horizon reasoning tasks, where dense rewards lose local validity once the student model’s prefix deviates from the teacher’s reasoning trajectory, leading to wasted computation and suboptimal training. To mitigate this, the authors propose Prune-OPD, a framework that dynamically detects prefix drift by monitoring real-time top-k overlap between student and teacher predictions. Upon significant drift, it monotonically attenuates the weights of subsequent unreliable rewards and triggers dynamic trajectory truncation, reallocating computational resources to supervision signals with higher compatibility. By integrating a local compatibility-aware dynamic pruning mechanism, Prune-OPD reduces training time by 37.6%–68.0% across benchmarks including AMC, AIME, and HMMT, while maintaining or even improving inference performance.
πŸ“ Abstract
On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
long-horizon reasoning
prefix drift
dense rewards
computational waste
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prune-OPD
on-policy distillation
prefix drift
dynamic rollout truncation
reward reliability
πŸ”Ž Similar Papers