SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in existing knowledge distillation theory by analyzing the convergence behavior of student models trained via stochastic gradient descent (SGD), particularly when the teacher provides Bayesian class probabilities or their noisy estimates. From a Bayesian perspective, this study establishes the first convergence theory for SGD-trained student models, demonstrating that exact Bayesian class probabilities reduce gradient variance and eliminate neighborhood terms in the optimization dynamics. Furthermore, it quantifies how noise in teacher outputs affects generalization. Building on these theoretical insights, the authors propose employing Bayesian deep learning models as teachers to enhance distillation performance. Empirical results validate the theory, showing up to a 4.27% improvement in student accuracy and a 30% reduction in training noise during convergence.

Technology Category

Application Category

📝 Abstract
Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs. While KD has shown strong empirical success in numerous applications, its theoretical underpinnings remain only partially understood. In this work, we adopt a Bayesian perspective on KD to rigorously analyze the convergence behavior of students trained with Stochastic Gradient Descent (SGD). We study two regimes: $(i)$ when the teacher provides the exact Bayes Class Probabilities (BCPs); and $(ii)$ supervision with noisy approximations of the BCPs. Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision. We further characterize how the level of noise affects generalization and accuracy. Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD. Consistent with our analysis, we experimentally demonstrate that students distilled from Bayesian teachers not only achieve higher accuracies (up to +4.27%), but also exhibit more stable convergence (up to 30% less noise), compared to students distilled from deterministic teachers.
Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation
Bayesian Teachers
Stochastic Gradient Descent
Convergence Analysis
Bayes Class Probabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Knowledge Distillation
Stochastic Gradient Descent
Bayes Class Probabilities
Convergence Analysis
Variance Reduction
🔎 Similar Papers
No similar papers found.
I
Itai Morad
School of ECE, Ben-Gurion University, Be'er Sheva, Israel
Nir Shlezinger
Nir Shlezinger
Ben-Gurion University of the Negev
Signal processingmachine learningcommunicationsinformation theory
Y
Y. Eldar
Faculty of Math and CS, Weizmann Institute of Science, Rehovot, Israel