Dual-Forward Path Teacher Knowledge Distillation: Bridging the Capacity Gap Between Teacher and Student

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In knowledge distillation (KD), large capacity gaps between teacher and student models impede efficient knowledge transfer, while existing methods struggle to simultaneously preserve knowledge fidelity and support dynamic adaptation. To address this, we propose Dual-Forward Prompt-based Teacher KD (DFPT-KD), the first KD framework integrating prompt learning: with the pre-trained teacher backbone frozen, a learnable auxiliary prompt path is introduced to dynamically modulate teacher outputs, aligning them with the student’s representational capacity for capacity-adaptive distillation. We further enhance DFPT-KD with DFPT-KD+, incorporating prompt learning, parameter-freezing optimization, and dual-path forward propagation to significantly improve knowledge transfer compatibility. Extensive experiments demonstrate that DFPT-KD achieves state-of-the-art accuracy across diverse student architectures—substantially outperforming conventional KD baselines.

Technology Category

Application Category

📝 Abstract

Knowledge distillation (KD) provides an effective way to improve the performance of a student network under the guidance of pre-trained teachers. However, this approach usually brings in a large capacity gap between teacher and student networks, limiting the distillation gains. Previous methods addressing this problem either discard accurate knowledge representation or fail to dynamically adjust the transferred knowledge, which is less effective in addressing the capacity gap problem and hinders students from achieving comparable performance with the pre-trained teacher. In this work, we extend the ideology of prompt-based learning to address the capacity gap problem, and propose Dual-Forward Path Teacher Knowledge Distillation (DFPT-KD), which replaces the pre-trained teacher with a novel dual-forward path teacher to supervise the learning of student. The key to DFPT-KD is prompt-based tuning, i.e., establishing an additional prompt-based forward path within the pre-trained teacher and optimizing it with the pre-trained teacher frozen to make the transferred knowledge compatible with the representation ability of the student. Extensive experiments demonstrate that DFPT-KD leads to trained students performing better than the vanilla KD. To make the transferred knowledge better compatible with the representation abilities of the student, we further fine-tune the whole prompt-based forward path, yielding a novel distillation approach dubbed DFPT-KD+. By extensive experiments, it is shown that DFPT-KD+ improves upon DFPT-KD and achieves state-of-the-art accuracy performance.

Problem

Research questions and friction points this paper is trying to address.

Addresses capacity gap in teacher-student knowledge distillation

Enhances knowledge transfer compatibility with student representation

Improves student performance beyond traditional distillation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-forward path teacher for knowledge distillation

Prompt-based tuning to adjust knowledge transfer

Fine-tuning prompt path for better student compatibility

🔎 Similar Papers

No similar papers found.