Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses a critical limitation in conventional language model distillation, where the Kullback–Leibler (KL) divergence is often dominated by high-probability tokens, thereby neglecting informative low-probability tail tokens. To mitigate this issue, the authors propose a tail-aware distillation method that explicitly decouples the teacher distribution into top-K head and tail components for the first time. They design a refined KL loss function that amplifies the contribution of tail signals while preserving computational efficiency. The approach is applicable to both pretraining and supervised distillation settings, achieving performance on par with or superior to existing methods across multiple benchmarks. Notably, it enables efficient large-scale distillation using only academic-grade computational resources.

Technology Category

Application Category

📝 Abstract

The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.

Problem

Research questions and friction points this paper is trying to address.

language model distillation

KL divergence

tail distribution

top-K probabilities

teacher-student learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

tail-aware distillation

decoupled KL divergence

language model distillation