🤖 AI Summary
This work addresses the lack of a unified modeling framework for feature-level distribution matching in knowledge distillation. We propose KD²M—the first formal, general-purpose framework for distribution-matching-based knowledge distillation. By systematically unifying distribution metrics—including Wasserstein distance, Maximum Mean Discrepancy (MMD), and Kullback–Leibler (KL) divergence—we establish a novel theoretical analysis paradigm and design a fair, cross-dataset and cross-task evaluation benchmark. Theoretically, we derive the first generalization error bound grounded in distribution matching. Empirically, we validate the effectiveness and complementarity of multiple metrics on CIFAR and ImageNet. KD²M provides an interpretable, scalable, and reproducible toolkit for modeling and evaluating feature-level knowledge transfer, advancing the field from heuristic design toward theory-driven development.
📝 Abstract
Knowledge Distillation (KD) seeks to transfer the knowledge of a teacher, towards a student neural net. This process is often done by matching the networks' predictions (i.e., their output), but, recently several works have proposed to match the distributions of neural nets' activations (i.e., their features), a process known as emph{distribution matching}. In this paper, we propose an unifying framework, Knowledge Distillation through Distribution Matching (KD$^{2}$M), which formalizes this strategy. Our contributions are threefold. We i) provide an overview of distribution metrics used in distribution matching, ii) benchmark on computer vision datasets, and iii) derive new theoretical results for KD.