AdaSwitch: Adaptive Switching Generation for Knowledge Distillation

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of simultaneously achieving high-quality supervision and training-inference consistency in knowledge distillation for small language models (SLMs). We propose an adaptive token-level knowledge distillation method that introduces an online output quality assessment mechanism coupled with a dynamic gating strategy, enabling real-time, token-wise switching between teacher-guided (in-policy) and student-autonomous (out-of-policy) generation. Unlike conventional distillation approaches that rely on static supervision sources, our method dynamically selects optimal supervision signals per token while preserving training-inference consistency. Experiments across three diverse datasets and two teacher-student model configurations demonstrate that our approach significantly outperforms baseline methods in accuracy, with modest computational overhead. The method exhibits strong practicality and generalizability across architectures and domains.

Technology Category

Application Category

📝 Abstract
Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.
Problem

Research questions and friction points this paper is trying to address.

Addresses performance challenges in small language models under computational constraints
Resolves trade-offs between training-inference mismatch and low-quality outputs in knowledge distillation
Dynamically combines on-policy and off-policy generation at token level
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically combines on-policy and off-policy generation
Selectively integrates teacher guidance using quality assessment
Preserves consistency while maintaining supervision quality
🔎 Similar Papers
No similar papers found.
J
Jingyu Peng
University of Science and Technology of China
M
Maolin Wang
City University of Hong Kong
Hengyi Cai
Hengyi Cai
Institute of Computing Technology, Chinese Academy of Sciences
Natural Language Processing
Y
Yuchen Li
Baidu Inc.
K
Kai Zhang
University of Science and Technology of China
Shuaiqiang Wang
Shuaiqiang Wang
Principal Architect of Search Strategy, Baidu Inc.
Large language modelsInformation retrieval
Dawei Yin
Dawei Yin
Senior Director, Head of Search Science at Baidu
Machine LearningWeb MiningData Mining
X
Xiangyu Zhao
City University of Hong Kong