🤖 AI Summary
This work addresses the issue of distribution drift and performance degradation that arises when diffusion models are recursively trained on synthetic data. It presents the first rigorous theoretical framework analyzing the distributional shift induced by such recursive training, deriving tight upper and lower bounds on the cumulative discrepancy between the generated and target distributions. The analysis quantitatively links this discrepancy to the score estimation error and the proportion of fresh real data incorporated during training. Through score function modeling, divergence analysis, and empirical validation, the study characterizes distinct drift regimes under different training mechanisms and demonstrates the accuracy of its theoretical predictions on both synthetic and image datasets.
📝 Abstract
Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift away from the target distribution. In this work, we theoretically analyze this phenomenon in the setting of score-based diffusion models. For a realistic pipeline where each training round uses a combination of synthetic data and fresh samples from the target distribution, we obtain upper and lower bounds on the accumulated divergence between the generated and target distributions. This allows us to characterize different regimes of drift, depending on the score estimation error and the proportion of fresh data used in each generation. We also provide empirical results on synthetic data and images to illustrate the theory.