Label-Consistent Dataset Distillation with Detector-Guided Refinement

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Dataset distillation (DD) often suffers from label inconsistency and loss of image details in synthetic samples, leading to degraded downstream performance. To address this, we propose a detector-guided diffusion distillation framework: first, a pretrained object detector identifies and rectifies low-confidence or mislabeled synthetic samples; second, a classification-detection dual-prototype-driven diffusion model generates high-fidelity candidate images; finally, a confidence-weighted and diversity-aware selection mechanism constructs a compact, robust surrogate dataset. Our method significantly improves image fidelity and label accuracy under extremely small-scale settings (e.g., 10–50 samples per class), achieving state-of-the-art downstream generalization on benchmarks including CIFAR and ImageNet. Moreover, it reduces both storage overhead and training cost compared to prior approaches.

Technology Category

Application Category

📝 Abstract

Dataset distillation (DD) aims to generate a compact yet informative dataset that achieves performance comparable to the original dataset, thereby reducing demands on storage and computational resources. Although diffusion models have made significant progress in dataset distillation, the generated surrogate datasets often contain samples with label inconsistencies or insufficient structural detail, leading to suboptimal downstream performance. To address these issues, we propose a detector-guided dataset distillation framework that explicitly leverages a pre-trained detector to identify and refine anomalous synthetic samples, thereby ensuring label consistency and improving image quality. Specifically, a detector model trained on the original dataset is employed to identify anomalous images exhibiting label mismatches or low classification confidence. For each defective image, multiple candidates are generated using a pre-trained diffusion model conditioned on the corresponding image prototype and label. The optimal candidate is then selected by jointly considering the detector's confidence score and dissimilarity to existing qualified synthetic samples, thereby ensuring both label accuracy and intra-class diversity. Experimental results demonstrate that our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.

Problem

Research questions and friction points this paper is trying to address.

Improve label consistency in distilled datasets

Enhance structural detail in synthetic samples

Optimize dataset distillation with detector guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detector-guided refinement for label consistency

Diffusion model generates multiple candidate images

Joint selection based on confidence and diversity

🔎 Similar Papers

GIFT: Unlocking Full Potential of Labels in Distilled Dataset at Near-zero Cost