When Models Don't Collapse: On the Consistency of Iterative MLE

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work investigates model collapse in generative modeling under iterative maximum likelihood estimation (MLE) training with mixed real and synthetic data. It addresses the fundamental question of whether collapse is inevitable as the proportion of real data decays to zero. Method: We develop the first non-asymptotic theoretical framework for this setting, operating under standard MLE consistency assumptions. We analyze convergence behavior, construct explicit counterexamples, and derive precise conditions for collapse onset. Contribution/Results: We prove that—contrary to common intuition—model collapse need not occur even when the real-data fraction vanishes, provided the MLE consistency condition holds; moreover, we show this condition is tight via a constructed counterexample where collapse arises upon its relaxation. We establish necessary and sufficient conditions for avoiding collapse and present the first rigorously characterized iterative generative paradigm exhibiting rapid collapse. Our analysis unifies the statistical foundations of synthetic-data training and provides theoretical guarantees for controllable, robust iterative generative modeling.

Technology Category

Application Category

📝 Abstract

The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about emph{model collapse}: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.

Problem

Research questions and friction points this paper is trying to address.

Analyzes model collapse in iterative generative model training

Identifies conditions to avoid performance degradation with synthetic data

Proves necessity of assumptions beyond MLE consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical study of model collapse in MLE

Non-asymptotic bounds prevent collapse

Identifies necessary assumptions to avoid collapse

🔎 Similar Papers

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing