🤖 AI Summary
Although asynchronous training improves throughput efficiency, it suffers from stale samples in long-horizon tasks, causing the learned policy to deviate from the ideal on-policy objective and inducing dual drift in both rollouts and supervisory signals. This work is the first to decouple on-policy bias into these two distinct drift components and introduces the f-OPD framework, which dynamically reweights stale data using sample-level freshness scores and incorporates a freshness-aware asynchronous on-policy distillation mechanism. This approach effectively constrains policy drift while preserving high throughput. Experimental results demonstrate that f-OPD significantly outperforms existing asynchronous methods on long-interaction tasks such as reasoning, tool use, and code generation, achieving performance on par with synchronous training.
📝 Abstract
Scaling on-policy distillation (OPD) for large language models (LLMs) confronts a fundamental tension: asynchronous execution is necessary for system efficiency, but structurally deviates from the ideal on-policy objective. To address this challenge, we theoretically decompose the objective discrepancy into rollout drift and supervision drift, capturing staleness in student rollout and teacher context, respectively. Building on this, we introduce a sample-level freshness score that quantifies the reliability of a buffered sample with respect to the on-policy objective. Guided by this signal, we further propose f-OPD, a novel framework that adaptively regulates stale-sample influence and constrains policy drift accumulated under asynchronous training. Across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon, f-OPD consistently achieves task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution. Our results establish the first recipe for achieving a performance-efficiency trade-off in OPD, paving the way for long-horizon agentic post-training at scale.