π€ AI Summary
In knowledge distillation, teacher representations often contain task-irrelevant redundant information, degrading student performance. To address this, we propose Redundancy-Informed Distillation (RID), the first framework to incorporate Partial Information Decomposition (PID) into knowledge distillation from an information-theoretic perspective. RID replaces conventional representation alignment with explicit optimization over task-relevant redundancyβi.e., information shared between teacher features and labels but not uniquely attributable to either. By modeling mutual information across network layers and applying PID-guided regularization, RID precisely identifies and transfers only the subset of teacher information that is discriminative for the downstream task. Experiments demonstrate that RID significantly improves student robustness and generalization under noisy or suboptimal teachers, consistently outperforming state-of-the-art distillation methods across multiple benchmarks. Our work reveals fundamental limits on information transfer in distillation and establishes an interpretable, theoretically grounded optimization pathway.
π Abstract
Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher's representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.