Quantifying Knowledge Distillation Using Partial Information Decomposition

📅 2024-11-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

In knowledge distillation, teacher representations often contain task-irrelevant redundant information, degrading student performance. To address this, we propose Redundancy-Informed Distillation (RID), the first framework to incorporate Partial Information Decomposition (PID) into knowledge distillation from an information-theoretic perspective. RID replaces conventional representation alignment with explicit optimization over task-relevant redundancy—i.e., information shared between teacher features and labels but not uniquely attributable to either. By modeling mutual information across network layers and applying PID-guided regularization, RID precisely identifies and transfers only the subset of teacher information that is discriminative for the downstream task. Experiments demonstrate that RID significantly improves student robustness and generalization under noisy or suboptimal teachers, consistently outperforming state-of-the-art distillation methods across multiple benchmarks. Our work reveals fundamental limits on information transfer in distillation and establishes an interpretable, theoretically grounded optimization pathway.

Technology Category

Application Category

📝 Abstract

Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher's representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.

Problem

Research questions and friction points this paper is trying to address.

Quantify task-relevant knowledge transfer in distillation

Identify redundant information between teacher and student

Improve distillation resilience under nuisance teacher models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Partial Information Decomposition for distillation analysis

Measures redundant information as task-relevant knowledge

Proposes Redundant Information Distillation (RID) framework

🔎 Similar Papers

No similar papers found.