Generative Distribution Distillation

📅 2025-07-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing two key challenges in knowledge distillation (KD)—high-dimensional optimization difficulty and lack of label-semantic supervision—this paper pioneers a conditional generative formulation of KD, proposing the Generative Distribution Distillation (GDD) framework. Methodologically, GDD introduces three core innovations: (1) a split-tokenization strategy enabling stable unsupervised distillation; (2) distribution contraction, a theoretically grounded technique proven equivalent to multi-task gradient proxy, which implicitly injects label supervision without explicit classification loss; and (3) support for efficient multi-step sampling during training. Under the ImageNet unsupervised setting, GDD surpasses the KL-divergence baseline by 16.29% in top-1 accuracy. With supervised training on ImageNet, ResNet-50 distilled via GDD achieves 82.28% top-1 accuracy—the highest reported among comparable KD methods.

Technology Category

Application Category

📝 Abstract
In this paper, we formulate the knowledge distillation (KD) as a conditional generative problem and propose the extit{Generative Distribution Distillation (GenDD)} framework. A naive extit{GenDD} baseline encounters two major challenges: the curse of high-dimensional optimization and the lack of semantic supervision from labels. To address these issues, we introduce a extit{Split Tokenization} strategy, achieving stable and effective unsupervised KD. Additionally, we develop the extit{Distribution Contraction} technique to integrate label supervision into the reconstruction objective. Our theoretical proof demonstrates that extit{GenDD} with extit{Distribution Contraction} serves as a gradient-level surrogate for multi-task learning, realizing efficient supervised training without explicit classification loss on multi-step sampling image representations. To evaluate the effectiveness of our method, we conduct experiments on balanced, imbalanced, and unlabeled data. Experimental results show that extit{GenDD} performs competitively in the unsupervised setting, significantly surpassing KL baseline by extbf{16.29%} on ImageNet validation set. With label supervision, our ResNet-50 achieves extbf{82.28%} top-1 accuracy on ImageNet in 600 epochs training, establishing a new state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

Addresses high-dimensional optimization in knowledge distillation
Lacks semantic supervision from labels in distillation
Improves unsupervised and supervised distillation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formulates KD as conditional generative problem
Introduces Split Tokenization for stable KD
Develops Distribution Contraction for label supervision
🔎 Similar Papers
No similar papers found.