Teacher-Guided Student Self-Knowledge Distillation Using Diffusion Model

πŸ“… 2026-02-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation of conventional knowledge distillation caused by mismatched feature distributions between teacher and student models. To mitigate this issue, the authors propose DSKD, a novel approach that integrates a lightweight diffusion model to perform denoising sampling on student features under the guidance of the teacher’s classifier. A self-distillation mechanism is then established between the original and denoised student features to enhance representation learning. Furthermore, locality-sensitive hashing (LSH) is employed to enable efficient feature alignment, effectively alleviating mapping discrepancies. Extensive experiments demonstrate that DSKD consistently outperforms existing distillation methods across multiple vision recognition tasks and model architectures, achieving superior performance and strong generalization capability.

Technology Category

Application Category

πŸ“ Abstract
Existing Knowledge Distillation (KD) methods often align feature information between teacher and student by exploring meaningful feature processing and loss functions. However, due to the difference in feature distributions between the teacher and student, the student model may learn incompatible information from the teacher. To address this problem, we propose teacher-guided student Diffusion Self-KD, dubbed as DSKD. Instead of the direct teacher-student alignment, we leverage the teacher classifier to guide the sampling process of denoising student features through a light-weight diffusion model. We then propose a novel locality-sensitive hashing (LSH)-guided feature distillation method between the original and denoised student features. The denoised student features encapsulate teacher knowledge and could be regarded as a teacher role. In this way, our DSKD method could eliminate discrepancies in mapping manners and feature distributions between the teacher and student, while learning meaningful knowledge from the teacher. Experiments on visual recognition tasks demonstrate that DSKD significantly outperforms existing KD methods across various models and datasets. Our code is attached in supplementary material.
Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation
Feature Distribution Mismatch
Teacher-Student Alignment
Incompatible Knowledge Transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Model
Knowledge Distillation
Self-Distillation
Locality-Sensitive Hashing
Feature Denoising
πŸ”Ž Similar Papers
No similar papers found.
Yu Wang
Yu Wang
Shanghai Jiao Tong University & Shanghai AI Laboratory
Natural Language ProcessingSpeech and Language ProcessingLarge Language Model
Chuanguang Yang
Chuanguang Yang
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionKnowledge DistillationRepresentation Learning
Zhulin An
Zhulin An
Institute Of Computing Technology Chinese Academy Of Sciences
Automatic Deep LearningLifelong Learning
Weilun Feng
Weilun Feng
Institute of Computing Technology, Chinese Academy of Sciences
Model CompressionMachine Learning
J
Jiarui Zhao
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
C
Chengqing Yu
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences
Libo Huang
Libo Huang
Institute of Computing Technology, Chinese Academy of Sciences
Continual LearningNeural Data Analysis
B
Boyu Diao
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences
Y
Yongjun Xu
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences