Teacher-Guided Policy Optimization for LLM Distillation

πŸ“… 2026-05-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

180K/year
πŸ€– AI Summary
Standard reverse KL divergence often yields uninformative negative feedback when the teacher and student distributions diverge significantly, thereby limiting the efficiency of large language model distillation. This work proposes an online policy distillation algorithm that integrates reinforcement learning with imitation learning, leveraging the teacher’s conditional predictions over trajectories generated by the student to provide dense and directionally informative guidance signals. The approach effectively mitigates distributional shift without requiring additional annotations while preserving the online policy update paradigm. Experimental results demonstrate that the proposed framework substantially outperforms existing methods on complex reasoning benchmarks and exhibits strong robustness across diverse teacher models.
πŸ“ Abstract
The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical limitation: when the student and teacher distributions diverge significantly, standard RKL often fails to yield meaningful improvement due to uninformative negative feedback. To address this inefficiency, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy algorithm that incorporates dense directional guidance by leveraging teacher predictions conditioned on the student's rollout. Because TGPO remains on-policy, the algorithm integrates seamlessly with existing RLVR frameworks without requiring additional data annotation. Experiments on complex reasoning benchmarks demonstrate that TGPO significantly outperforms standard baselines and is robust to different teachers.
Problem

Research questions and friction points this paper is trying to address.

LLM distillation
Reverse KL
on-policy learning
teacher-student divergence
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-Guided Policy Optimization
Reverse KL
LLM Distillation
On-policy Learning
Dense Directional Guidance
πŸ”Ž Similar Papers
No similar papers found.