🤖 AI Summary
Synthetic image data often exhibits inconsistent quality and vastly exceeds real-data quantities, leading to domain shift and degraded model performance. Method: We propose a fine-tuning-free synthetic data augmentation framework for diffusion models. (1) Leveraging joint disambiguation of class names by LLMs and CLIP, we design contextualized and stylized prompting strategies to enhance semantic fidelity and diversity of generated images; (2) an auxiliary batch normalization module is introduced to mitigate distributional shift between synthetic and real domains. Contribution/Results: We provide the first empirical evidence that off-the-shelf generative models—without adaptation—can substantially boost recognition accuracy. Our diversity-aware generation strategy sustains consistent performance gains even when scaling ImageNet to six times its original size. The method achieves significant classification accuracy improvements on both ImageNet and cross-domain benchmarks, demonstrating strong generalization capability.
📝 Abstract
Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.