🤖 AI Summary
This work addresses the limitations of existing large language model distillation methods, which implicitly couple the source of prefix context with the direction of KL divergence, thereby constraining flexibility in objective design and performance trade-offs. The authors propose decoupling two orthogonal dimensions—prefix source and token-level KL direction—in sequence-level KL divergence, yielding four unified distillation objectives. They establish formal connections between these objectives and established paradigms including supervised fine-tuning (SFT), DAgger, offline reinforcement learning, and optimal policy distillation at the gradient level. Through this framework, the study reveals three key trade-offs: accuracy versus entropy, quality versus computation, and training length versus stability. To navigate these, they introduce a KL mixing strategy and an entropy-based length curriculum scheduling. Evaluated on mathematical reasoning tasks, the method improves Avg@k by 3.6 and Pass@k by up to 5.8 percentage points while reducing average response length by approximately threefold, effectively mitigating entropy collapse and length inflation in long-sequence distillation.
📝 Abstract
Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.