A Discrepancy-Based Perspective on Dataset Condensation

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses dataset condensation (DC), formulating it as a probabilistic approximation problem between the source and synthetic distributions via a unified discrepancy-based framework. Unlike conventional DC approaches that solely optimize generalization performance, this work is the first to incorporate discrepancy theory into DC, enabling joint optimization of multiple objectives—including generalization, robustness, and privacy preservation. We design a differentiable distribution distance metric and integrate it with gradient-based optimization of synthetic samples, yielding highly compact synthetic sets ($M ll N$). Experiments demonstrate that models trained from scratch on the condensed datasets achieve performance on par with or surpassing that of models trained on the full original datasets across all evaluated metrics—despite drastic reductions in data volume. These results validate the framework’s effectiveness, versatility, and practical utility for efficient and multifaceted dataset distillation.

Technology Category

Application Category

📝 Abstract
Given a dataset of finitely many elements $mathcal{T} = {mathbf{x}_i}_{i = 1}^N$, the goal of dataset condensation (DC) is to construct a synthetic dataset $mathcal{S} = { ilde{mathbf{x}}_j}_{j = 1}^M$ which is significantly smaller ($M ll N$) such that a model trained from scratch on $mathcal{S}$ achieves comparable or even superior generalization performance to a model trained on $mathcal{T}$. Recent advances in DC reveal a close connection to the problem of approximating the data distribution represented by $mathcal{T}$ with a reduced set of points. In this work, we present a unified framework that encompasses existing DC methods and extend the task-specific notion of DC to a more general and formal definition using notions of discrepancy, which quantify the distance between probability distribution in different regimes. Our framework broadens the objective of DC beyond generalization, accommodating additional objectives such as robustness, privacy, and other desirable properties.
Problem

Research questions and friction points this paper is trying to address.

Constructing smaller synthetic datasets matching original performance
Extending dataset condensation beyond generalization to robustness
Unifying condensation methods via formal discrepancy-based framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrepancy-based framework for dataset condensation
Unified approach encompassing existing DC methods
Extends objectives to robustness and privacy
🔎 Similar Papers
No similar papers found.