🤖 AI Summary
Diffusion models suffer from heavy reliance on high-quality, in-domain training data, limiting their scalability and robustness. Method: This work proposes a novel paradigm leveraging low-quality, synthetic, and out-of-distribution images to enhance model performance. We first establish theoretically that controlled noise injection mitigates the trade-off between data bias and sample scarcity. Building upon the power-law spectral decay and locality properties of natural images, we develop a general robust training framework integrating JPEG compression, motion/Gaussian blur modeling, and stride-aware theoretical error bound analysis across diffusion steps. Contribution/Results: Our approach achieves state-of-the-art FID on ImageNet and significantly improves both fidelity and diversity in text-to-image generation. Theoretically, the framework guarantees optimal bias-variance trade-off at every denoising step of the diffusion process.
📝 Abstract
We show how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from all available images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. We first validate our framework by successfully training diffusion models with images synthetically corrupted by Gaussian blur, JPEG compression, and motion blur. We then use our framework to achieve state-of-the-art ImageNet FID, and we show significant improvements in both image quality and diversity for text-to-image generative modeling. The core insight is that noise dampens the initial skew between the desired high-quality distribution and the mixed distribution we actually observe. We provide rigorous theoretical justification for our approach by analyzing the trade-off between learning from biased data versus limited unbiased data across diffusion times.