Grounding and Enhancing Informativeness and Utility in Dataset Distillation

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of effectively preserving critical information from original datasets while enhancing the training efficacy of synthetic data in dataset distillation. To this end, the authors propose InfoUtil, a novel method that formalizes sample informativeness and utility as core distillation criteria for the first time. By integrating Shapley values with gradient norms, InfoUtil constructs a theoretically grounded optimization objective to synthesize high-quality, compact training sets. Evaluated on ImageNet-1K using ResNet-18 as the benchmark architecture, InfoUtil achieves a 6.1% performance improvement over the current state-of-the-art, substantially boosting the training effectiveness of distilled data.

Technology Category

Application Category

📝 Abstract

Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define optimal dataset distillation mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1\% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.

Problem

Research questions and friction points this paper is trying to address.

Dataset Distillation

Informativeness

Utility

Knowledge Distillation

Synthetic Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset Distillation

Informativeness

Utility