$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Although asynchronous training improves throughput efficiency, it suffers from stale samples in long-horizon tasks, causing the learned policy to deviate from the ideal on-policy objective and inducing dual drift in both rollouts and supervisory signals. This work is the first to decouple on-policy bias into these two distinct drift components and introduces the f-OPD framework, which dynamically reweights stale data using sample-level freshness scores and incorporates a freshness-aware asynchronous on-policy distillation mechanism. This approach effectively constrains policy drift while preserving high throughput. Experimental results demonstrate that f-OPD significantly outperforms existing asynchronous methods on long-interaction tasks such as reasoning, tool use, and code generation, achieving performance on par with synchronous training.

📝 Abstract

Scaling on-policy distillation (OPD) for large language models (LLMs) confronts a fundamental tension: asynchronous execution is necessary for system efficiency, but structurally deviates from the ideal on-policy objective. To address this challenge, we theoretically decompose the objective discrepancy into rollout drift and supervision drift, capturing staleness in student rollout and teacher context, respectively. Building on this, we introduce a sample-level freshness score that quantifies the reliability of a buffered sample with respect to the on-policy objective. Guided by this signal, we further propose f-OPD, a novel framework that adaptively regulates stale-sample influence and constrains policy drift accumulated under asynchronous training. Across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon, f-OPD consistently achieves task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution. Our results establish the first recipe for achieving a performance-efficiency trade-off in OPD, paving the way for long-horizon agentic post-training at scale.

Problem

Research questions and friction points this paper is trying to address.

on-policy distillation

asynchronous training

policy drift

sample staleness

long-horizon agentic tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

asynchronous training

freshness-aware control