Swapped Logit Distillation via Bi-level Teacher Alignment

๐Ÿ“… 2025-04-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

194K/year
๐Ÿค– AI Summary
In knowledge distillation, the teacher modelโ€™s raw logit distribution often causes erroneous alignment in student predictions with low confidence. To address this, we propose Exchange Logit Distillation (ELD), a novel paradigm that constructs a dual-teacher collaborative framework by dynamically swapping teacher logitsโ€”thereby decoupling output modeling from probability calibration. We further introduce a phased loss scheduling mechanism to mitigate the risk of misguidance from overconfident (i.e., maximum) logits. Unlike conventional single-teacher approaches that assume direct hard or soft label transfer, ELD abandons this restrictive assumption. Extensive experiments on ResNet and ViT backbones demonstrate consistent improvements across multiple image classification benchmarks, outperforming state-of-the-art distillation methods. Notably, ELD yields substantial gains in both accuracy and generalization for low-capacity student models.

Technology Category

Application Category

๐Ÿ“ Abstract
Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its original distribution, which can possibly lead to incorrect predictions. In this article, we propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD). SLD is proposed under two assumptions: (1) the wrong prediction occurs when the prediction label confidence is not the maximum; (2) the"natural"limit of probability remains uncertain as the best value addition to the target cannot be determined. To address these issues, we propose a swapped logit processing scheme. Through this approach, we find that the swap method can be effectively extended to teacher and student outputs, transforming into two teachers. We further introduce loss scheduling to boost the performance of two teachers' alignment. Extensive experiments on image classification tasks demonstrate that SLD consistently performs best among previous state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Addresses incorrect predictions in knowledge distillation
Proposes swapped logit processing for teacher-student alignment
Enhances performance via loss scheduling and dual teachers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Swapped logit processing for distillation
Bi-level teacher alignment method
Loss scheduling enhances performance
๐Ÿ”Ž Similar Papers