How Is Uncertainty Propagated in Knowledge Distillation?

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the uncertainty inherent in knowledge distillation, which arises from the stochasticity in both teacher outputs and student training or inference. Conventional single-point estimation methods often neglect or distort this uncertainty. The paper presents the first explicit distinction and quantification of inter-student and intra-student uncertainty in distillation, reframing the process as one of uncertainty propagation and transformation. To this end, it introduces multi-teacher response sampling, variance-aware fusion, and inverse-variance weighting strategies. Formal guarantees are provided for linear models, while extensive experiments on neural networks and large language models demonstrate significant reductions in systemic noise and hallucinations, thereby enhancing student model stability and fidelity to teacher uncertainty.

Technology Category

Application Category

📝 Abstract

Knowledge distillation transfers behavior from a teacher to a student model, but the process is inherently stochastic: teacher outputs, student training, and student inference can all be random. Collapsing these uncertainties to a single point estimate can distort what is learned. We systematically study how uncertainty propagates through knowledge distillation across three representative model classes--linear regression, feed-forward neural networks, and large language models (LLMs)--and propose simple corrections. We distinguish inter-student uncertainty (variance across independently distilled students) from intra-student uncertainty (variance of a single student's predictive distribution), showing that standard single-response knowledge distillation suppresses intra-student variance while leaving substantial inter-student variability. To address these mismatches, we introduce two variance-aware strategies: averaging multiple teacher responses, which reduces noise at rate $O(1/k)$, and variance-weighting, which combines teacher and student estimates via inverse-variance weighting to yield a minimum-variance estimator. We provide formal guarantees in linear regression, validate the methods in neural networks, and demonstrate empirical gains in LLM distillation, including reduced systematic noise and hallucination. These results reframe knowledge distillation as an uncertainty transformation and show that variance-aware distillation produces more stable students that better reflect teacher uncertainty.

Problem

Research questions and friction points this paper is trying to address.

uncertainty propagation

knowledge distillation

variance

stochasticity

model uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty propagation

knowledge distillation

variance-aware distillation