Right Time to Learn:Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the limited generalization capability of knowledge distillation (KD). We propose Spaced KD, the first KD framework to formally incorporate the spacing effect from cognitive science: instead of continuous knowledge transfer, it periodically samples from the teacher model to establish a temporally aware distillation mechanism. Built upon SGD optimization, Spaced KD integrates dynamic teacher scheduling with spaced sampling and theoretically proves that it steers optimization toward flatter minima in the loss landscape, thereby enhancing robustness and generalization. On Tiny-ImageNet, Spaced KD improves accuracy by 2.31% over online distillation and by 3.34% over self-distillation; gains are consistent across both CNN and ViT architectures. Our core contribution is the principled formalization of spaced learning as an analyzable, scalable KD paradigm—uniquely bridging cognitive theory with rigorous optimization analysis and empirical effectiveness.

Technology Category

Application Category

📝 Abstract
Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). Although it was originally proposed to train a more compact ``student'' model from a large ``teacher'' model, many recent efforts have focused on adapting it to promote generalization of the model itself, such as online KD and self KD. % as an effective way Here, we propose an accessible and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named emph{spacing effect} in biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. With both theoretical and empirical analyses, we demonstrate that the benefits of the proposed Spaced KD stem from convergence to a flatter loss landscape during stochastic gradient descent (SGD). We perform extensive experiments to validate the effectiveness of Spaced KD in improving the learning performance of DNNs (e.g., the performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively).
Problem

Research questions and friction points this paper is trying to address.

Enhancing generalization in knowledge distillation models
Implementing bio-inspired spacing effect in learning
Improving DNN performance via spaced knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spaced KD improves online KD
Spaced KD enhances self KD
Spacing effect boosts learning performance
🔎 Similar Papers
No similar papers found.
G
Guanglong Sun
School of Life Sciences, IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, China; Beijing Academy of Artificial Intelligence, Beijing, China
Hongwei Yan
Hongwei Yan
Tsinghua University
Brain-Inspired AIContinual LearningAI for Science
Liyuan Wang
Liyuan Wang
Tsinghua University
bio-inspired learningcontinual learningneuroscience
Q
Qian Li
School of Life Sciences, IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, China
Bo Lei
Bo Lei
Beijing Academy of Artificial Intelligence
NeuroscienceArtificial intelligence
Y
Yi Zhong
School of Life Sciences, IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, China