ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $alpha$-$eta$-Divergence

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

In knowledge distillation, conventional forward KL divergence (FKLD) and reverse KL divergence (RKLD) suffer from “hardness-concentration” and “confidence-concentration”, respectively—leading to imbalanced probability mass allocation. To address this, we propose ABKD, a general distillation framework based on the α-β divergence, which is the first to decouple and jointly model these two concentration effects. We theoretically prove that the α-β divergence smoothly interpolates between FKLD and RKLD, enabling tunable trade-offs via its hyperparameters; further, it facilitates fine-grained probability mass redistribution through gradient reweighting. Extensive experiments across 17 cross-modal (language/vision) datasets and 12 teacher-student architectures demonstrate that ABKD consistently outperforms FKLD- and RKLD-based baselines, significantly improving student model generalization and robustness. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the extbf{ extit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the extbf{ extit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $alpha$-$eta$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving an effective trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy. The code is available at https://github.com/ghwang-s/abkd.

Problem

Research questions and friction points this paper is trying to address.

Balancing Hardness-Concentration and Confidence-Concentration effects in Knowledge Distillation

Addressing imbalance in mode-concentration effects using α-β-divergence

Improving student model focus on target class and teacher distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses α-β-divergence for balanced knowledge transfer

Interpolates between forward and reverse KLD effects

Balances hardness and confidence concentration effects

🔎 Similar Papers

Revisiting Knowledge Distillation under Distribution Shift