🤖 AI Summary
This work investigates the learning dynamics and failure boundaries of two-layer autoencoders on high-dimensional manifold data. Motivated by the prevalence of mode collapse—and even full model collapse—in diffusion-based generative modeling on low-dimensional manifolds, we develop a high-dimensional asymptotic analysis coupled with online stochastic gradient descent (SGD) dynamical modeling. We establish, for the first time, a tight asymptotic characterization of the low-dimensional projection of generated samples, explicitly quantifying its dependence on the training sample size. Our theory uncovers the sequential failure pathway from mode collapse to model collapse, rigorously identifying insufficient sample size as the primary trigger. Moreover, we derive an analytical critical sample-size threshold that demarcates the onset of model collapse. This framework provides the first rigorous theoretical foundation for risk assessment in synthetic-data retraining pipelines.
📝 Abstract
In this manuscript, we consider the problem of learning a flow or diffusion-based generative model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a high-dimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated by the learned model, ascertaining in particular its dependence on the number of training samples. Building on this analysis, we discuss how mode collapse can arise, and lead to model collapse when the generative model is re-trained on generated synthetic data.