🤖 AI Summary
This work addresses the inefficient utilization of “dark knowledge” in knowledge distillation due to teacher–student model capacity mismatch. We identify two empirical regularities in large-capacity teacher outputs: (i) low discriminability among non-ground-truth class probabilities, yet (ii) stable inter-class relative affinity relationships. Building on this, we establish the first quantitative link between teacher capacity and dark knowledge structure, proposing a novel paradigm that enhances the discriminability of non-ground-truth logits to mitigate capacity mismatch—moving beyond conventional reliance solely on teacher accuracy. Methodologically, we integrate logit softening with temperature calibration, an inter-class discrepancy enhancement module, and a multi-teacher contrastive distillation framework. Experiments on CIFAR-100 and ImageNet demonstrate significant performance gains for lightweight student networks, consistently outperforming state-of-the-art methods including FitNet and RKD. The approach proves robust across diverse teacher–student capacity configurations.
📝 Abstract
Knowledge Distillation (KD) could transfer the ``dark knowledge"of a well-performed yet large neural network to a weaker but lightweight one. From the view of output logits and softened probabilities, this paper goes deeper into the dark knowledge provided by teachers with different capacities. Two fundamental observations are: (1) a larger teacher tends to produce probability vectors that are less distinct between non-ground-truth classes; (2) teachers with different capacities are basically consistent in their cognition of relative class affinity. Abundant experimental studies verify these observations and in-depth empirical explanations are provided. The difference in dark knowledge leads to the peculiar phenomenon named ``capacity mismatch"that a more accurate teacher does not necessarily perform as well as a smaller teacher when teaching the same student network. Enlarging the distinctness between non-ground-truth class probabilities for larger teachers could address the capacity mismatch problem. This paper explores multiple simple yet effective ways to achieve this goal and verify their success by comparing them with popular KD methods that solve the capacity mismatch.