Teacher-Guided Policy Optimization for LLM Distillation

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Standard reverse KL divergence often yields uninformative negative feedback when the teacher and student distributions diverge significantly, thereby limiting the efficiency of large language model distillation. This work proposes an online policy distillation algorithm that integrates reinforcement learning with imitation learning, leveraging the teacher’s conditional predictions over trajectories generated by the student to provide dense and directionally informative guidance signals. The approach effectively mitigates distributional shift without requiring additional annotations while preserving the online policy update paradigm. Experimental results demonstrate that the proposed framework substantially outperforms existing methods on complex reasoning benchmarks and exhibits strong robustness across diverse teacher models.

📝 Abstract

The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical limitation: when the student and teacher distributions diverge significantly, standard RKL often fails to yield meaningful improvement due to uninformative negative feedback. To address this inefficiency, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy algorithm that incorporates dense directional guidance by leveraging teacher predictions conditioned on the student's rollout. Because TGPO remains on-policy, the algorithm integrates seamlessly with existing RLVR frameworks without requiring additional data annotation. Experiments on complex reasoning benchmarks demonstrate that TGPO significantly outperforms standard baselines and is robust to different teachers.

Problem

Research questions and friction points this paper is trying to address.

LLM distillation

Reverse KL

on-policy learning

teacher-student divergence

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-Guided Policy Optimization

Reverse KL

LLM Distillation