Kendall's Ο„ Coefficient for Logits Distillation

πŸ“… 2024-09-26
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In knowledge distillation, the KL divergence loss suffers from gradient magnitudes proportional to teacher logits, leading to insufficient updates for low-probability classes and weakened inter-class relationship modeling. To address this, we propose Rank-Kendall Knowledge Distillation (RKKD), the first method to incorporate a differentiable Kendall’s Ο„ coefficient into the distillation objective. RKKD replaces absolute logit value matching with relative ranking consistency among logit channels, establishing a temperature-free rank-order constraint. This formulation avoids optimization direction bias in soft-label matching and preserves discriminative information from small-magnitude logits, explicitly maintaining fine-grained inter-class ordinal relationships. Extensive experiments on CIFAR-100 and ImageNet demonstrate that RKKD consistently improves student accuracy across diverse teacher-student architecture pairs. Notably, it delivers stable performance gains for lightweight student models and exhibits strong generalization across datasets and model scales.

Technology Category

Application Category

πŸ“ Abstract
Knowledge distillation typically employs the Kullback-Leibler (KL) divergence to constrain the student model's output to match the soft labels provided by the teacher model exactly. However, sometimes the optimization direction of the KL divergence loss is not always aligned with the task loss, where a smaller KL divergence could lead to erroneous predictions that diverge from the soft labels. This limitation often results in suboptimal optimization for the student. Moreover, even under temperature scaling, the KL divergence loss function tends to overly focus on the larger-valued channels in the logits, disregarding the rich inter-class information provided by the multitude of smaller-valued channels. This hard constraint proves too challenging for lightweight students, hindering further knowledge distillation. To address this issue, we propose a plug-and-play ranking loss based on Kendall's $ au$ coefficient, called Rank-Kendall Knowledge Distillation (RKKD). RKKD balances the attention to smaller-valued channels by constraining the order of channel values in student logits, providing more inter-class relational information. The rank constraint on the top-valued channels helps avoid suboptimal traps during optimization. We also discuss different differentiable forms of Kendall's $ au$ coefficient and demonstrate that the proposed ranking loss function shares a consistent optimization objective with the KL divergence. Extensive experiments on the CIFAR-100 and ImageNet datasets show that our RKKD can enhance the performance of various knowledge distillation baselines and offer broad improvements across multiple teacher-student architecture combinations.
Problem

Research questions and friction points this paper is trying to address.

Optimizing KL divergence in distillation leads to sub-optimal solutions
Gradient imbalance weakens inter-class information transfer
Propose Kendall's Ο„ ranking loss to rebalance gradients
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play Kendall's Ο„ ranking loss
Rebalances gradients for low-probability channels
Enhances logit-based distillation frameworks
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuchen Guan
Tsinghua Shenzhen International Graduate School
Runxi Cheng
Runxi Cheng
Tsinghua University
K
Kang Liu
Tsinghua Shenzhen International Graduate School
C
Chun Yuan
Tsinghua Shenzhen International Graduate School