Swapped Logit Distillation via Bi-level Teacher Alignment

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

In knowledge distillation, the teacher model’s raw logit distribution often causes erroneous alignment in student predictions with low confidence. To address this, we propose Exchange Logit Distillation (ELD), a novel paradigm that constructs a dual-teacher collaborative framework by dynamically swapping teacher logits—thereby decoupling output modeling from probability calibration. We further introduce a phased loss scheduling mechanism to mitigate the risk of misguidance from overconfident (i.e., maximum) logits. Unlike conventional single-teacher approaches that assume direct hard or soft label transfer, ELD abandons this restrictive assumption. Extensive experiments on ResNet and ViT backbones demonstrate consistent improvements across multiple image classification benchmarks, outperforming state-of-the-art distillation methods. Notably, ELD yields substantial gains in both accuracy and generalization for low-capacity student models.

Technology Category

Application Category

📝 Abstract

Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its original distribution, which can possibly lead to incorrect predictions. In this article, we propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD). SLD is proposed under two assumptions: (1) the wrong prediction occurs when the prediction label confidence is not the maximum; (2) the"natural"limit of probability remains uncertain as the best value addition to the target cannot be determined. To address these issues, we propose a swapped logit processing scheme. Through this approach, we find that the swap method can be effectively extended to teacher and student outputs, transforming into two teachers. We further introduce loss scheduling to boost the performance of two teachers' alignment. Extensive experiments on image classification tasks demonstrate that SLD consistently performs best among previous state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses incorrect predictions in knowledge distillation

Proposes swapped logit processing for teacher-student alignment

Enhances performance via loss scheduling and dual teachers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Swapped logit processing for distillation

Bi-level teacher alignment method

Loss scheduling enhances performance

🔎 Similar Papers

BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation