🤖 AI Summary
Current data distillation (DD) evaluation heavily relies on test accuracy, making it susceptible to confounding factors such as soft labels and strong data augmentations—thus failing to reflect the intrinsic information quality of distilled images and compromising evaluation fidelity and reproducibility. To address this, we propose DD-Ranking: the first decoupled, robust, unified evaluation framework for DD. It isolates external technical influences via rank-consistency analysis, cross-model generalization testing, and ablation-driven benchmarking, focusing instead on the genuine information gain encoded in distilled data. Additionally, zero-shot transfer evaluation is introduced to assess the fundamental representational capacity of distilled samples. Experiments on CIFAR and ImageNet-1K reveal that performance gains reported for several state-of-the-art methods stem from “evaluation contamination” rather than true distillation efficacy. DD-Ranking substantially improves method discriminability, evaluation fairness, and result reproducibility.
📝 Abstract
In recent years, dataset distillation has provided a reliable solution for data compression, where models trained on the resulting smaller synthetic datasets achieve performance comparable to those trained on the original datasets. To further improve the performance of synthetic datasets, various training pipelines and optimization objectives have been proposed, greatly advancing the field of dataset distillation. Recent decoupled dataset distillation methods introduce soft labels and stronger data augmentation during the post-evaluation phase and scale dataset distillation up to larger datasets (e.g., ImageNet-1K). However, this raises a question: Is accuracy still a reliable metric to fairly evaluate dataset distillation methods? Our empirical findings suggest that the performance improvements of these methods often stem from additional techniques rather than the inherent quality of the images themselves, with even randomly sampled images achieving superior results. Such misaligned evaluation settings severely hinder the development of DD. Therefore, we propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods. By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.