Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in large language model distillation using Reverse KL (RKL) divergence, which often leads to overconfident student models, reduced output diversity, and insufficient alignment on non-target classes. The study is the first to uncover an anomalous gradient mechanism in RKL, wherein gradients from non-target classes undesirably push the target logit. To mitigate this issue, the authors propose Diversity-aware RKL (DRKL), which decomposes gradients to eliminate harmful effects while preserving RKL’s optimization benefits and strengthening supervision on non-target classes. Extensive experiments demonstrate that DRKL consistently outperforms Forward KL, standard RKL, and other state-of-the-art distillation methods across multiple datasets and model families, achieving superior performance while effectively maintaining output diversity.
📝 Abstract
Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.
Problem

Research questions and friction points this paper is trying to address.

Reverse Kullback-Leibler divergence
LLM distillation
output diversity
overconfident predictions
tail alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse Kullback-Leibler divergence
model distillation
output diversity
gradient decomposition
large language models
🔎 Similar Papers
No similar papers found.
H
Hoang-Chau Luong
Rochester Institute of Technology, Rochester, NY, USA
D
Dat Ba Tran
Rowan University, Glassboro, NJ, USA
Lingwei Chen
Lingwei Chen
Rochester Institute of Technology
Trustworthy Machine LearningSecurityMachine Learning