ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes an adaptive distillation framework to address the high computational cost and excessive supervision on low-value trajectory segments inherent in existing on-policy distillation methods, which typically rely on full-trajectory updates. The approach dynamically adjusts the training window length by continuously evaluating the consistency between the student policy’s prefix and the teacher policy during online execution. It further incorporates a delayed full-trajectory probing mechanism alongside a staleness control strategy to select efficient and accurate supervision signals. Empirical results on mathematical and code reasoning tasks demonstrate that the method reduces training costs by up to 4.1Γ— while maintaining or even improving model accuracy compared to conventional approaches.
πŸ“ Abstract
On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low marginal value for the current student. We revisit this assumption through the useful supervision horizon: student-induced rollouts can drift from teacher-preferred continuations, while aligned prefixes may already preserve the long-horizon OPD update direction. We propose ADWIN, an adaptive-window framework for OPD that treats rollout length as an online admissibility decision, training on short teacher-anchored prefixes while using delayed full-rollout probes to audit prefix--full alignment and adapt the next horizon with staleness control. Across math and code reasoning benchmarks in single-task, multi-task, and strong-to-weak settings, ADWIN improves the accuracy--compute trade-off over full-rollout OPD and prefix-based baselines, reducing end-to-end training cost by up to 4.1 times while achieving comparable or better accuracy.
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
supervision horizon
rollout efficiency
adaptive training
teacher-student alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation
adaptive window
horizon-aware
trajectory prefix
supervision efficiency
πŸ”Ž Similar Papers
2024-07-21arXiv.orgCitations: 1