🤖 AI Summary
How does inter-model recursive training—where generative AI models serve as mutual sources of training data—affect performance evolution? This study presents the first systematic empirical investigation of cross-modal recursive training dynamics between multilingual language and image generation models using synthetic data. Through multi-stage simulations, theoretical modeling, and large-scale experiments, we propose a data-mediated model interaction framework. We find that cross-model data injection expands conceptual coverage but concurrently induces task-performance convergence and output homogenization; thus, performance gains and degradation co-occur as a novel evolutionary mechanism. Our work reveals an intrinsic tension within the “generate–retrain” feedback loop of AI data ecosystems, exposing critical trade-offs in model sustainability. It provides both theoretical foundations and empirical warnings for responsible model evolution and data governance.
📝 Abstract
The internet is full of AI-generated content while also serving as a common source of training data for generative AI (genAI) models. This duality raises the possibility that future genAI models may be trained on other models' generated outputs. Prior work has studied consequences of models training on their own generated outputs, but limited work has considered what happens if models ingest content produced by other models. Given society's increasing dependence on genAI tools, understanding downstream effects of such data-mediated model interactions is critical. To this end, we provide empirical evidence for how data-mediated interactions might unfold in practice, develop a theoretical model for this interactive training process, and show experimentally possible long-term results of such interactions. We find that data-mediated interactions can benefit models by exposing them to novel concepts perhaps missed in original training data, but also can homogenize their performance on shared tasks.