π€ AI Summary
This work addresses the phenomenon of βmodel collapseβ in large language models trained recursively on synthetic data, which manifests as variance contraction in the embedding space and degradation of representations. The authors propose SIGMA, the first scalable spectral analysis framework that leverages the spectral properties of Gram matrices, integrating random matrix theory with efficient spectral bound estimation to circumvent the need for full eigendecomposition in large-scale models. SIGMA effectively quantifies and predicts the onset of collapse, elucidates its underlying mathematical mechanisms, and provides a deployable health-monitoring metric for recursive training regimes. By bridging theoretical rigor with practical scalability, this approach represents a significant advance in understanding and mitigating representational collapse in self-improving language models.
π Abstract
The rapid adoption of synthetic data for training Large Language Models (LLMs) has introduced the technical challenge of"model collapse"-a degenerative process where recursive training on model-generated content leads to a contraction of distributional variance and representational quality. While the phenomenology of collapse is increasingly evident, rigorous methods to quantify and predict its onset in high-dimensional spaces remain elusive. In this paper, we introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework that benchmarks model collapse through the spectral lens of the embedding Gram matrix. By deriving and utilizing deterministic and stochastic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space. Crucially, our stochastic formulation enables scalable estimation of these bounds, making the framework applicable to large-scale foundation models where full eigendecomposition is intractable. We demonstrate that SIGMA effectively captures the transition towards degenerate states, offering both theoretical insights into the mechanics of collapse and a practical, scalable tool for monitoring the health of recursive training pipelines.