AdaSwitch: Adaptive Switching Generation for Knowledge Distillation

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This paper addresses the challenge of simultaneously achieving high-quality supervision and training-inference consistency in knowledge distillation for small language models (SLMs). We propose an adaptive token-level knowledge distillation method that introduces an online output quality assessment mechanism coupled with a dynamic gating strategy, enabling real-time, token-wise switching between teacher-guided (in-policy) and student-autonomous (out-of-policy) generation. Unlike conventional distillation approaches that rely on static supervision sources, our method dynamically selects optimal supervision signals per token while preserving training-inference consistency. Experiments across three diverse datasets and two teacher-student model configurations demonstrate that our approach significantly outperforms baseline methods in accuracy, with modest computational overhead. The method exhibits strong practicality and generalizability across architectures and domains.

Technology Category

Application Category

📝 Abstract

Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.

Problem

Research questions and friction points this paper is trying to address.

Addresses performance challenges in small language models under computational constraints

Resolves trade-offs between training-inference mismatch and low-quality outputs in knowledge distillation

Dynamically combines on-policy and off-policy generation at token level

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically combines on-policy and off-policy generation

Selectively integrates teacher guidance using quality assessment

Preserves consistency while maintaining supervision quality

🔎 Similar Papers

Revisiting Knowledge Distillation under Distribution Shift