Dataset Distillation as Pushforward Optimal Quantization

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses efficient data distillation: approximating the original data distribution using a minimal synthetic dataset to drastically reduce generative model training costs. We propose D4M, the first method to theoretically link data distillation with optimal quantization and Wasserstein barycenters, formulating distillation as a push-forward optimal quantization problem. To enforce fidelity to the source distribution, we introduce a diffusion prior consistency constraint and implement push-forward measure approximation via an encoder-decoder architecture. Unlike conventional bilevel optimization approaches, D4M is computationally efficient and highly scalable. Empirically, D4M achieves state-of-the-art performance on ImageNet-1K and under high-shot-per-class settings, demonstrating superior distillation efficiency and generalization compared to existing methods.

Technology Category

Application Category

📝 Abstract
Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose a simple extension of the state-of-the-art data distillation method D4M, achieving better performance on the ImageNet-1K dataset with trivial additional computation, and state-of-the-art performance in higher image-per-class settings.
Problem

Research questions and friction points this paper is trying to address.

Efficient Dataset
Computational Cost
Generative Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset Distillation
Generative Models
Optimal Point Representation
🔎 Similar Papers
No similar papers found.