🤖 AI Summary
Although knowledge distillation is widely employed to enhance model generalization, its theoretical underpinnings remain poorly understood. This work models the teacher–student training dynamics as a coupled stochastic process and introduces a novel “distillation divergence” to quantify the discrepancy between teacher and student. Building upon this, we develop an information-theoretic framework for generalization analysis and derive upper and lower bounds on the student’s generalization error that explicitly depend on the distillation divergence. Notably, we show that the local flatness of the teacher model strictly tightens the upper bound. In the Gaussian linear setting, we further provide an interpretable decomposition of the error into bias, variance, and a rank bottleneck, offering both theoretical insights and practical principles for designing effective distillation algorithms.
📝 Abstract
Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.