🤖 AI Summary
This study investigates dataset size—a long-overlooked critical variable—in knowledge distillation (KD). Addressing the limited explanatory power of existing theories (e.g., label smoothing and dark knowledge hypotheses) across varying data regimes, we design a systematic experimental framework spanning multiple tasks, architectures, and datasets, with strict control over sample count and model scale, and analyze dynamic distillation loss behavior. Results reveal KD’s pronounced data efficiency: performance gains are substantially larger under data-scarce conditions. The label smoothing hypothesis is empirically refuted, whereas the dark knowledge hypothesis receives stronger empirical support. Crucially, this work establishes dataset size as a fundamental determinant of KD effectiveness and theoretical interpretability—challenging prevailing assumptions in distillation literature. It provides the first rigorous evidence that data scale governs both KD’s empirical success and its underlying mechanism, thereby enabling principled theory refinement and advancing KD deployment in low-resource settings.
📝 Abstract
The concept of knowledge distillation (KD) describes the training of a student model from a teacher model and is a widely adopted technique in deep learning. However, it is still not clear how and why distillation works. Previous studies focus on two central aspects of distillation: model size, and generalisation. In this work we study distillation in a third dimension: dataset size. We present a suite of experiments across a wide range of datasets, tasks and neural architectures, demonstrating that the effect of distillation is not only preserved but amplified in low-data regimes. We call this newly discovered property the data efficiency of distillation. Equipped with this new perspective, we test the predictive power of existing theories of KD as we vary the dataset size. Our results disprove the hypothesis that distillation can be understood as label smoothing, and provide further evidence in support of the dark knowledge hypothesis. Finally, we analyse the impact of modelling factors such as the objective, scale and relative number of samples on the observed phenomenon. Ultimately, this work reveals that the dataset size may be a fundamental but overlooked variable in the mechanisms underpinning distillation.