Elucidating the Design Space of Dataset Condensation

📅 2024-04-21

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

195K/year

🤖 AI Summary

Existing dataset distillation methods suffer from poor generalization on small-scale data and prohibitively high computational overhead on large-scale datasets, hindering training efficiency and practical deployment. This paper proposes a unified distillation framework that systematically models and optimizes the entire distillation design space for the first time. Key contributions include: (i) a soft class-aware matching mechanism to enhance semantic consistency between synthetic and real data; (ii) gradient-matching-based meta-optimization coupled with theory-guided architecture selection; and (iii) a dynamic, self-adaptive learning rate scheduling strategy. On ImageNet-1k, our method achieves 48.6% Top-1 accuracy with ResNet-18 using only 10 synthetic images per class (IPC), corresponding to a compression ratio of 0.78%. It significantly outperforms state-of-the-art approaches such as SRe2L, while simultaneously ensuring diversity, fidelity, and training efficiency.

Technology Category

Application Category

📝 Abstract

Dataset condensation, a concept within data-centric learning, efficiently transfers critical attributes from an original dataset to a synthetic version, maintaining both diversity and realism. This approach significantly improves model training efficiency and is adaptable across multiple application areas. Previous methods in dataset condensation have faced challenges: some incur high computational costs which limit scalability to larger datasets (e.g., MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets (e.g., SRe2L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching and adjusting the learning rate schedule. These strategies are grounded in empirical evidence and theoretical backing. Our resulting approach, Elucidate Dataset Condensation (EDC), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%. This performance exceeds those of SRe2L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively.

Problem

Research questions and friction points this paper is trying to address.

Data Refinement

Big Data

Small Data Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Elucidate Dataset Condensation

Efficiency

Accuracy

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Research Scientist, AI & Systems Co-design (PhD)