🤖 AI Summary
In self-consuming training loops (STL), models iteratively train on synthetic data, often suffering from performance degradation or catastrophic collapse; however, the underlying collapse mechanisms and differences in robustness remain theoretically unexplained.
Method: We introduce the novel concept of *recursive stability* and derive a theoretical bound on STL generalization error. Leveraging Transformer architectural properties and the ratio of real to synthetic data, we conduct rigorous generalization analysis and derive convergence conditions for in-context learning within STL.
Contribution: We prove—for the first time—that a constant fraction of real data suffices to guarantee convergence of Transformers under STL. We identify architecture design and data-mixing strategies as decisive factors for stability. Furthermore, we provide a verifiable collapse criterion and principled guidelines for optimal synthetic-data scaling, establishing a theoretical foundation and practical framework for safe, sustainable self-iterative training.
📝 Abstract
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Consequently, models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs). However, the empirical results have been strikingly inconsistent: some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding to explain this discrepancy. This paper introduces the intriguing notion of recursive stability and presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs. We further extend this analysis to transformers in in-context learning, showing that even a constant-sized proportion of real data ensures convergence, while also providing insights into optimal synthetic data sizing.