🤖 AI Summary
This work addresses the instability of trajectory-level KL divergence in offline policy distillation (OPD) for multi-turn agent tasks, which often leads to error accumulation and performance degradation. The authors propose TCOD, a temporal curriculum distillation framework that is the first to identify this instability issue and mitigate it through a novel temporal curriculum scheduling mechanism: progressively increasing the trajectory depth observed by the student model from short to long horizons. By integrating trajectory truncation, KL divergence monitoring, and curriculum learning, TCOD substantially outperforms the original OPD across three established benchmarks—ALFWorld, WebShop, and ScienceWorld—achieving up to an 18-percentage-point improvement in task success rate. Notably, TCOD also generalizes to scenarios where the teacher model fails, even surpassing its performance in certain cases.
📝 Abstract
On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule.Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails.