Right Time to Learn:Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This paper addresses the limited generalization capability of knowledge distillation (KD). We propose Spaced KD, the first KD framework to formally incorporate the spacing effect from cognitive science: instead of continuous knowledge transfer, it periodically samples from the teacher model to establish a temporally aware distillation mechanism. Built upon SGD optimization, Spaced KD integrates dynamic teacher scheduling with spaced sampling and theoretically proves that it steers optimization toward flatter minima in the loss landscape, thereby enhancing robustness and generalization. On Tiny-ImageNet, Spaced KD improves accuracy by 2.31% over online distillation and by 3.34% over self-distillation; gains are consistent across both CNN and ViT architectures. Our core contribution is the principled formalization of spaced learning as an analyzable, scalable KD paradigm—uniquely bridging cognitive theory with rigorous optimization analysis and empirical effectiveness.

Technology Category

Application Category

📝 Abstract

Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). Although it was originally proposed to train a more compact ``student'' model from a large ``teacher'' model, many recent efforts have focused on adapting it to promote generalization of the model itself, such as online KD and self KD. % as an effective way Here, we propose an accessible and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named emph{spacing effect} in biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. With both theoretical and empirical analyses, we demonstrate that the benefits of the proposed Spaced KD stem from convergence to a flatter loss landscape during stochastic gradient descent (SGD). We perform extensive experiments to validate the effectiveness of Spaced KD in improving the learning performance of DNNs (e.g., the performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively).

Problem

Research questions and friction points this paper is trying to address.

Enhancing generalization in knowledge distillation models

Implementing bio-inspired spacing effect in learning

Improving DNN performance via spaced knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spaced KD improves online KD

Spaced KD enhances self KD

Spacing effect boosts learning performance

🔎 Similar Papers

No similar papers found.