Learnability-Guided Diffusion for Dataset Distillation

πŸ“… 2026-04-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing dataset distillation methods suffer from information redundancy in synthetic samples, leading to overlapping training signals and reduced efficiency. This work proposes a learnability-guided curriculum distillation strategy that incrementally generates synthetic data in stages via a diffusion model, dynamically constructing complementary samples tailored to the current model’s learning capacity. By incorporating constraints from a reference model, the approach balances training utility with sample effectiveness. The method significantly reduces redundancy by 39.1% and enables stage-specific specialization of distilled samples, achieving state-of-the-art performance with top-1 accuracies of 60.1%, 87.2%, and 72.9% on ImageNet-1K, ImageNette, and ImageWoof, respectively.
πŸ“ Abstract
Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%). Our code is available on our project page https://jachansantiago.github.io/learnability-guided-distillation/.
Problem

Research questions and friction points this paper is trying to address.

dataset distillation
redundancy
training signals
synthetic dataset
learnability
Innovation

Methods, ideas, or system contributions that make the work stand out.

dataset distillation
diffusion models
learnability-guided
curriculum learning
training redundancy
πŸ”Ž Similar Papers
No similar papers found.
J
Jeffrey A. Chan-Santiago
Institute of Artificial Intelligence, University of Central Florida
Mubarak Shah
Mubarak Shah
Trustee Chair Professor of Computer Science, University of Central Florida
Computer Vision