Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue of distribution drift and performance degradation that arises when diffusion models are recursively trained on synthetic data. It presents the first rigorous theoretical framework analyzing the distributional shift induced by such recursive training, deriving tight upper and lower bounds on the cumulative discrepancy between the generated and target distributions. The analysis quantitatively links this discrepancy to the score estimation error and the proportion of fresh real data incorporated during training. Through score function modeling, divergence analysis, and empirical validation, the study characterizes distinct drift regimes under different training mechanisms and demonstrates the accuracy of its theoretical predictions on both synthetic and image datasets.

Technology Category

Application Category

📝 Abstract
Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift away from the target distribution. In this work, we theoretically analyze this phenomenon in the setting of score-based diffusion models. For a realistic pipeline where each training round uses a combination of synthetic data and fresh samples from the target distribution, we obtain upper and lower bounds on the accumulated divergence between the generated and target distributions. This allows us to characterize different regimes of drift, depending on the score estimation error and the proportion of fresh data used in each generation. We also provide empirical results on synthetic data and images to illustrate the theory.
Problem

Research questions and friction points this paper is trying to address.

error propagation
model collapse
diffusion models
synthetic data
distribution drift
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models
error propagation
model collapse
score-based generative models
synthetic data training
🔎 Similar Papers
No similar papers found.
N
Nail B. Khelifa
Department of Engineering, University of Cambridge, Cambridge, United Kingdom
R
Richard E. Turner
Department of Engineering, University of Cambridge, Cambridge, United Kingdom
Ramji Venkataramanan
Ramji Venkataramanan
University of Cambridge
Information theoryHigh-dimensional statisticsMachine Learning