The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better

📅 2024-06-07
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
Despite growing reliance on generative synthetic images (e.g., from Stable Diffusion) for data augmentation in image classification, their empirical effectiveness relative to real-world alternatives remains inadequately benchmarked. Method: This work systematically evaluates generative synthetic images against retrieval-based real images—obtained via CLIP cross-modal retrieval from LAION-2B—across multiple fine-grained classification tasks, using ViT and ResNet backbones for fine-tuning. Contribution/Results: Retrieval-based real images consistently match or significantly outperform synthetic counterparts across all tasks. Performance degradation in synthetic data is primarily attributed to generation artifacts and semantic misalignment. Crucially, this study establishes “simple retrieval” as a critical, empirically grounded baseline for evaluating synthetic data efficacy—challenging the prevailing overreliance on generative methods. To foster reproducibility and paradigmatic shift, the authors open-source all code, datasets, and models, advocating a transition in synthetic data research from “generation-first” to “utility-first” principles.

Technology Category

Application Category

📝 Abstract
Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data. However, every synthetic image ultimately originates from the upstream data used to train the generator. Does the intermediate generator provide additional information over directly training on relevant parts of the upstream data? Grounding this question in the setting of image classification, we compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion -- a generative model trained on the LAION-2B dataset -- against finetuning on targeted real images retrieved directly from LAION-2B. We show that while synthetic data can benefit some downstream tasks, it is universally matched or outperformed by real data from the simple retrieval baseline. Our analysis suggests that this underperformance is partially due to generator artifacts and inaccurate task-relevant visual details in the synthetic images. Overall, we argue that targeted retrieval is a critical baseline to consider when training with synthetic data -- a baseline that current methods do not yet surpass. We release code, data, and models at https://github.com/scottgeng00/unmet-promise.
Problem

Research questions and friction points this paper is trying to address.

Generative Models
Image Recognition
Performance Comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real Image Training
Synthetic Image Comparison
Model Performance Enhancement
🔎 Similar Papers
No similar papers found.