Rethinking Selective Knowledge Distillation

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic understanding in selective supervision strategies for knowledge distillation of large language models, which often struggle to balance efficiency and performance. The study is the first to decouple and comparatively analyze selection mechanisms across three dimensions—position, category, and sample—and proposes SE-KD, a position-aware distillation strategy based on student model entropy. This approach is further extended into SE-KD 3X, a multidimensional collaborative framework. By integrating offline teacher caching, the method consistently improves accuracy and task consistency across multiple benchmarks while substantially reducing computational overhead: achieving 70% less runtime, 18% lower peak memory usage, and 80% reduced storage consumption—all without sacrificing performance.

Technology Category

Application Category

📝 Abstract
Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.
Problem

Research questions and friction points this paper is trying to address.

knowledge distillation
selective distillation
large language models
importance signals
selection policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

selective knowledge distillation
student-entropy-guided
position selection
efficiency optimization
large language models
🔎 Similar Papers
No similar papers found.
A
Almog Tavor
Blavatnik School of Computer Science and AI, Tel Aviv University
I
Itay Ebenspanger
Blavatnik School of Computer Science and AI, Tel Aviv University
N
Neil Cnaan
Blavatnik School of Computer Science and AI, Tel Aviv University
Mor Geva
Mor Geva
Tel Aviv University, Google Research
Natural Language Processing