๐ค AI Summary
In knowledge distillation, the teacher modelโs raw logit distribution often causes erroneous alignment in student predictions with low confidence. To address this, we propose Exchange Logit Distillation (ELD), a novel paradigm that constructs a dual-teacher collaborative framework by dynamically swapping teacher logitsโthereby decoupling output modeling from probability calibration. We further introduce a phased loss scheduling mechanism to mitigate the risk of misguidance from overconfident (i.e., maximum) logits. Unlike conventional single-teacher approaches that assume direct hard or soft label transfer, ELD abandons this restrictive assumption. Extensive experiments on ResNet and ViT backbones demonstrate consistent improvements across multiple image classification benchmarks, outperforming state-of-the-art distillation methods. Notably, ELD yields substantial gains in both accuracy and generalization for low-capacity student models.
๐ Abstract
Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its original distribution, which can possibly lead to incorrect predictions. In this article, we propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD). SLD is proposed under two assumptions: (1) the wrong prediction occurs when the prediction label confidence is not the maximum; (2) the"natural"limit of probability remains uncertain as the best value addition to the target cannot be determined. To address these issues, we propose a swapped logit processing scheme. Through this approach, we find that the swap method can be effectively extended to teacher and student outputs, transforming into two teachers. We further introduce loss scheduling to boost the performance of two teachers' alignment. Extensive experiments on image classification tasks demonstrate that SLD consistently performs best among previous state-of-the-art methods.