π€ AI Summary
Standard reverse KL divergence often yields uninformative negative feedback when the teacher and student distributions diverge significantly, thereby limiting the efficiency of large language model distillation. This work proposes an online policy distillation algorithm that integrates reinforcement learning with imitation learning, leveraging the teacherβs conditional predictions over trajectories generated by the student to provide dense and directionally informative guidance signals. The approach effectively mitigates distributional shift without requiring additional annotations while preserving the online policy update paradigm. Experimental results demonstrate that the proposed framework substantially outperforms existing methods on complex reasoning benchmarks and exhibits strong robustness across diverse teacher models.
π Abstract
The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical limitation: when the student and teacher distributions diverge significantly, standard RKL often fails to yield meaningful improvement due to uninformative negative feedback. To address this inefficiency, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy algorithm that incorporates dense directional guidance by leveraging teacher predictions conditioned on the student's rollout. Because TGPO remains on-policy, the algorithm integrates seamlessly with existing RLVR frameworks without requiring additional data annotation. Experiments on complex reasoning benchmarks demonstrate that TGPO significantly outperforms standard baselines and is robust to different teachers.