🤖 AI Summary
This paper addresses the limited generalization capability of knowledge distillation (KD). We propose Spaced KD, the first KD framework to formally incorporate the spacing effect from cognitive science: instead of continuous knowledge transfer, it periodically samples from the teacher model to establish a temporally aware distillation mechanism. Built upon SGD optimization, Spaced KD integrates dynamic teacher scheduling with spaced sampling and theoretically proves that it steers optimization toward flatter minima in the loss landscape, thereby enhancing robustness and generalization. On Tiny-ImageNet, Spaced KD improves accuracy by 2.31% over online distillation and by 3.34% over self-distillation; gains are consistent across both CNN and ViT architectures. Our core contribution is the principled formalization of spaced learning as an analyzable, scalable KD paradigm—uniquely bridging cognitive theory with rigorous optimization analysis and empirical effectiveness.
📝 Abstract
Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). Although it was originally proposed to train a more compact ``student'' model from a large ``teacher'' model, many recent efforts have focused on adapting it to promote generalization of the model itself, such as online KD and self KD. % as an effective way Here, we propose an accessible and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named emph{spacing effect} in biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. With both theoretical and empirical analyses, we demonstrate that the benefits of the proposed Spaced KD stem from convergence to a flatter loss landscape during stochastic gradient descent (SGD). We perform extensive experiments to validate the effectiveness of Spaced KD in improving the learning performance of DNNs (e.g., the performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively).