🤖 AI Summary
This paper addresses the challenge of simultaneously achieving high-quality supervision and training-inference consistency in knowledge distillation for small language models (SLMs). We propose an adaptive token-level knowledge distillation method that introduces an online output quality assessment mechanism coupled with a dynamic gating strategy, enabling real-time, token-wise switching between teacher-guided (in-policy) and student-autonomous (out-of-policy) generation. Unlike conventional distillation approaches that rely on static supervision sources, our method dynamically selects optimal supervision signals per token while preserving training-inference consistency. Experiments across three diverse datasets and two teacher-student model configurations demonstrate that our approach significantly outperforms baseline methods in accuracy, with modest computational overhead. The method exhibits strong practicality and generalizability across architectures and domains.
📝 Abstract
Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.