🤖 AI Summary
This work addresses a critical gap in existing knowledge distillation theory by analyzing the convergence behavior of student models trained via stochastic gradient descent (SGD), particularly when the teacher provides Bayesian class probabilities or their noisy estimates. From a Bayesian perspective, this study establishes the first convergence theory for SGD-trained student models, demonstrating that exact Bayesian class probabilities reduce gradient variance and eliminate neighborhood terms in the optimization dynamics. Furthermore, it quantifies how noise in teacher outputs affects generalization. Building on these theoretical insights, the authors propose employing Bayesian deep learning models as teachers to enhance distillation performance. Empirical results validate the theory, showing up to a 4.27% improvement in student accuracy and a 30% reduction in training noise during convergence.
📝 Abstract
Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs. While KD has shown strong empirical success in numerous applications, its theoretical underpinnings remain only partially understood. In this work, we adopt a Bayesian perspective on KD to rigorously analyze the convergence behavior of students trained with Stochastic Gradient Descent (SGD). We study two regimes: $(i)$ when the teacher provides the exact Bayes Class Probabilities (BCPs); and $(ii)$ supervision with noisy approximations of the BCPs. Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision. We further characterize how the level of noise affects generalization and accuracy. Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD. Consistent with our analysis, we experimentally demonstrate that students distilled from Bayesian teachers not only achieve higher accuracies (up to +4.27%), but also exhibit more stable convergence (up to 30% less noise), compared to students distilled from deterministic teachers.