Generalized Kullback-Leibler Divergence Loss

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two inherent limitations of Kullback–Leibler (KL) divergence in knowledge distillation and adversarial training: its asymmetric optimization behavior and sample-level bias. We propose the Generalized KL (GKL) loss, which theoretically establishes equivalence between KL and a decomposed form—termed DKL—and accordingly decouples KL into a weighted mean squared error term and a soft-label cross-entropy term. A smooth weighting function is introduced to mitigate convergence difficulties induced by overconfident soft labels. Furthermore, GKL incorporates class-level global statistical modeling to suppress sample-wise bias. Empirically, GKL significantly enhances model robustness and generalization: it achieves new state-of-the-art adversarial robustness on RobustBench, and attains leading or highly competitive performance across knowledge distillation benchmarks—including CIFAR-10/100, ImageNet, and CLIP-based distillation tasks.

Technology Category

Application Category

📝 Abstract
In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.
Problem

Research questions and friction points this paper is trying to address.

Improves KL loss for knowledge distillation scenarios
Introduces class-wise global information to reduce bias
Achieves state-of-the-art adversarial robustness and distillation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Kullback-Leibler loss structure
Class-wise global information integration
Generalized Kullback-Leibler Divergence loss
🔎 Similar Papers