Exploring Dark Knowledge under Various Teacher Capacities and Addressing Capacity Mismatch

📅 2024-05-21
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficient utilization of “dark knowledge” in knowledge distillation due to teacher–student model capacity mismatch. We identify two empirical regularities in large-capacity teacher outputs: (i) low discriminability among non-ground-truth class probabilities, yet (ii) stable inter-class relative affinity relationships. Building on this, we establish the first quantitative link between teacher capacity and dark knowledge structure, proposing a novel paradigm that enhances the discriminability of non-ground-truth logits to mitigate capacity mismatch—moving beyond conventional reliance solely on teacher accuracy. Methodologically, we integrate logit softening with temperature calibration, an inter-class discrepancy enhancement module, and a multi-teacher contrastive distillation framework. Experiments on CIFAR-100 and ImageNet demonstrate significant performance gains for lightweight student networks, consistently outperforming state-of-the-art methods including FitNet and RKD. The approach proves robust across diverse teacher–student capacity configurations.

Technology Category

Application Category

📝 Abstract
Knowledge Distillation (KD) could transfer the ``dark knowledge"of a well-performed yet large neural network to a weaker but lightweight one. From the view of output logits and softened probabilities, this paper goes deeper into the dark knowledge provided by teachers with different capacities. Two fundamental observations are: (1) a larger teacher tends to produce probability vectors that are less distinct between non-ground-truth classes; (2) teachers with different capacities are basically consistent in their cognition of relative class affinity. Abundant experimental studies verify these observations and in-depth empirical explanations are provided. The difference in dark knowledge leads to the peculiar phenomenon named ``capacity mismatch"that a more accurate teacher does not necessarily perform as well as a smaller teacher when teaching the same student network. Enlarging the distinctness between non-ground-truth class probabilities for larger teachers could address the capacity mismatch problem. This paper explores multiple simple yet effective ways to achieve this goal and verify their success by comparing them with popular KD methods that solve the capacity mismatch.
Problem

Research questions and friction points this paper is trying to address.

Investigates dark knowledge transfer in teachers of varying capacities
Examines impact of teacher size on class probability distinctness
Proposes solutions to address capacity mismatch in distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing dark knowledge via output logits
Exploring teacher capacity impact on distillation
Addressing capacity mismatch with simple methods
🔎 Similar Papers
No similar papers found.
Xin-Chun Li
Xin-Chun Li
School of Artificial Intelligence, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu 210023, China
W
Wen-Shu Fan
School of Artificial Intelligence, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu 210023, China
B
Bowen Tao
School of Artificial Intelligence, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu 210023, China
Le Gan
Le Gan
Nanjing University of Science and Technology
Artificial IntelligenceMachine Learning
De-Chuan Zhan
De-Chuan Zhan
Nanjing University, China
Machine LearningData Mining