Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing diffusion-based data augmentation methods suffer from insufficient diversity guarantees and require scaling the dataset by 10–30× to improve in-distribution performance. To address this, we propose *learning-lag-guided subset augmentation*: selectively applying diffusion-based synthesis only to the subset of samples (30%–40%) that remain poorly classified early in training—i.e., those exhibiting high learning lag. Theoretical analysis—based on a two-layer CNN—and extensive experiments jointly demonstrate that this strategy homogenizes feature learning speeds across samples and mitigates noise amplification, outperforming full-dataset augmentation. Our method is orthogonal to optimization algorithms and compatible with SGD, SAM, and conventional augmentations. On CIFAR-10, CIFAR-100, and TinyImageNet, it yields an average accuracy gain of 2.8%; notably, SGD augmented with our method surpasses SAM in performance. Robust improvements are observed across diverse model architectures and optimizers.

Technology Category

Application Category

📝 Abstract

Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts the performance by up to 2.8% in a variety of scenarios, including training ResNet, ViT and DenseNet on CIFAR-10, CIFAR-100, and TinyImageNet, with a range of optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet. It can also easily stack with existing weak and strong augmentation strategies to further boost the performance.

Problem

Research questions and friction points this paper is trying to address.

Targeted synthetic image augmentation improves classifier generalization

Optimizing synthetic data diversity reduces augmentation volume (30-40%)

Selective augmentation outperforms full-dataset approaches and SOTA optimizers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Targeted synthetic augmentation via diffusion models

Augment only unlearned early training data

Boosts performance with 30%-40% data augmentation

🔎 Similar Papers

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better