🤖 AI Summary
This paper addresses the challenge in knowledge distillation (KD) where student models struggle to accurately inherit critical information flow paths from teacher models. To this end, we propose InDistill—a distillation warm-up framework tailored for model compression. Its core contributions are threefold: (1) it pioneers *information flow path preservation* as the primary distillation objective; (2) it introduces a *hierarchical difficulty-aware curriculum learning mechanism* that progressively guides the student to emulate the teacher’s essential forward-pass pathways; and (3) it proposes a *width-adaptive pruning strategy* that requires no auxiliary encoder and enables plug-and-play alignment of layer widths between teacher and student. Extensive experiments on CIFAR-10/100 and ImageNet demonstrate that InDistill consistently outperforms state-of-the-art KD methods across both image classification and retrieval tasks. The implementation is publicly available.
📝 Abstract
In this paper, we introduce InDistill, a method that serves as a warmup stage for enhancing Knowledge Distillation (KD) effectiveness. InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student. This is achieved via a training scheme based on curriculum learning that considers the distillation difficulty of each layer and the critical learning periods when the information flow paths are established. This procedure can lead to a student model that is better prepared to learn from the teacher. To ensure the applicability of InDistill across a wide range of teacher-student pairs, we also incorporate a pruning operation when there is a discrepancy in the width of the teacher and student layers. This pruning operation reduces the width of the teacher's intermediate layers to match those of the student, allowing direct distillation without the need for an encoding stage. The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets demonstrating that preserving the information flow paths consistently increases the performance of the baseline KD approaches on both classification and retrieval settings. The code is available at https://github.com/gsarridis/InDistill.