Efficient Knowledge Distillation via Curriculum Extraction

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high storage overhead and limited practicality of knowledge distillation methods that rely on retaining intermediate checkpoints of the teacher network. We propose an implicit curriculum distillation method that requires no storage of intermediate models. Our approach projects the teacher’s hidden states at each layer onto random low-dimensional subspaces, thereby constructing a naturally progressive sequence of representations that serve as an implicit curriculum to guide the student’s gradual learning. Crucially, we provide theoretical guarantees that the final teacher model alone suffices to approximate the performance of full progressive distillation. Experiments demonstrate that our method significantly outperforms standard one-shot distillation on sparse parity learning—matching the accuracy of complete progressive distillation—while also yielding consistent improvements in language modeling tasks. Moreover, it is compatible with both two-layer networks and Transformer architectures.

Technology Category

Application Category

📝 Abstract
Knowledge distillation is a technique used to train a small student network using the output generated by a large teacher network, and has many empirical advantages~citep{Hinton2015DistillingTK}. While the standard one-shot approach to distillation only uses the output of the final teacher network, recent work~citep{panigrahi2024progressive} has shown that using intermediate checkpoints from the teacher's training process as an implicit ``curriculum'' for progressive distillation can significantly speed up training. However, such schemes require storing these checkpoints, and often require careful selection of the intermediate checkpoints to train on, which can be impractical for large-scale training. In this paper, we show that a curriculum can be emph{extracted} from just the fully trained teacher network, and that this extracted curriculum can give similar efficiency benefits to those of progressive distillation. Our extraction scheme is natural; we use a random projection of the hidden representations of the teacher network to progressively train the student network, before training using the output of the full network. We show that our scheme significantly outperforms one-shot distillation and achieves a performance similar to that of progressive distillation for learning sparse parities with two-layer networks, and provide theoretical guarantees for this setting. Additionally, we show that our method outperforms one-shot distillation even when using transformer-based architectures, both for sparse-parity learning, and language modeling tasks.
Problem

Research questions and friction points this paper is trying to address.

Extracting curriculum from trained teacher network
Improving efficiency of knowledge distillation
Avoiding storage of intermediate teacher checkpoints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extract curriculum from trained teacher network
Use random projection for progressive training
Outperform one-shot distillation in performance
🔎 Similar Papers
No similar papers found.