🤖 AI Summary
Standard online policy distillation (OPD) suffers from high-variance updates, vanishing gradients in zero-advantage regions, and insufficient exploration. This work proposes Asymmetric Online Policy Distillation (AOPD), which introduces, for the first time, a token-level asymmetric feedback mechanism: preserving policy gradient updates in positive-advantage regions while replacing ineffective negative reinforcement in non-positive regions with minimization of local KL divergence. By integrating advantage-weighted policy gradients, local divergence constraints, and token-level teacher signals, AOPD effectively mitigates the structural limitations of standard OPD. Experiments demonstrate that AOPD significantly outperforms OPD on mathematical reasoning benchmarks, achieving average performance gains of 4.09 and 8.34 under strong and weak initialization settings, respectively, while maintaining higher policy entropy and enhancing capability retention in sequential tool-use tasks.
📝 Abstract
On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient.We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.