Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets

📅 2025-02-08

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Training diffusion models on limited, noisy, and copyright-sensitive data often leads to overfitting to original samples and poor approximation of the true data distribution. To address this, we propose the Stochastic Forward-Backward Deconvolution (SFBD) framework—the first to integrate deconvolution theory into diffusion model training. SFBD employs lightweight pretraining on a small set of clean data (e.g., 4% of CIFAR-10), theoretically guaranteeing convergence to the underlying data distribution while substantially improving noise utilization efficiency. Crucially, it avoids memorizing original sensitive content, thereby mitigating copyright risks. Experiments demonstrate state-of-the-art performance: on CIFAR-10, SFBD achieves an FID of 6.31 using only 4% clean images and further improves to 3.58 with 10%, significantly outperforming noise-only baselines and breaking the performance bottleneck in low-data, high-noise regimes.

Technology Category

Application Category

📝 Abstract

Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain an FID of $6.31$ on CIFAR-10 with just $4%$ clean images (and $3.58$ with $10%$). Theoretically, we prove that SFBD guides the model to learn the true data distribution. The result also highlights the importance of pretraining on limited but clean data or the alternative from similar datasets. Empirical studies further support these findings and offer additional insights.

Problem

Research questions and friction points this paper is trying to address.

Training diffusion models with noisy datasets

Overcoming practical data collection challenges

Enhancing deconvolution with limited clean data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic Forward-Backward Deconvolution method

Pretrain model with limited clean data

Learn true data distribution from noise

🔎 Similar Papers

Learning Diffusion Priors from Observations by Expectation Maximization