ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This work proposes an adaptive distillation framework to address the high computational cost and excessive supervision on low-value trajectory segments inherent in existing on-policy distillation methods, which typically rely on full-trajectory updates. The approach dynamically adjusts the training window length by continuously evaluating the consistency between the student policy’s prefix and the teacher policy during online execution. It further incorporates a delayed full-trajectory probing mechanism alongside a staleness control strategy to select efficient and accurate supervision signals. Empirical results on mathematical and code reasoning tasks demonstrate that the method reduces training costs by up to 4.1× while maintaining or even improving model accuracy compared to conventional approaches.

📝 Abstract

On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low marginal value for the current student. We revisit this assumption through the useful supervision horizon: student-induced rollouts can drift from teacher-preferred continuations, while aligned prefixes may already preserve the long-horizon OPD update direction. We propose ADWIN, an adaptive-window framework for OPD that treats rollout length as an online admissibility decision, training on short teacher-anchored prefixes while using delayed full-rollout probes to audit prefix--full alignment and adapt the next horizon with staleness control. Across math and code reasoning benchmarks in single-task, multi-task, and strong-to-weak settings, ADWIN improves the accuracy--compute trade-off over full-rollout OPD and prefix-based baselines, reducing end-to-end training cost by up to 4.1 times while achieving comparable or better accuracy.

Problem

Research questions and friction points this paper is trying to address.

on-policy distillation

supervision horizon

rollout efficiency

adaptive training

teacher-student alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

adaptive window

horizon-aware