🤖 AI Summary
Although diffusion models are trained in high-dimensional ambient spaces, they effectively learn distributions supported on low-dimensional data manifolds, thereby circumventing the curse of dimensionality. This work provides the first theoretical characterization of their intrinsic manifold learning mechanism and introduces a “collapse-and-refine” process: under small noise levels, the denoising map rapidly collapses onto the manifold, while at moderate noise levels, it refines the density over the manifold. Building on this insight, we propose Score-induced Latent Diffusion (SiLD), a framework that unifies denoising score matching, manifold projection, and density estimation without heuristic regularization. SiLD achieves a sample complexity that depends only on the intrinsic dimension of the manifold. Empirical results demonstrate that SiLD matches or surpasses VAE-based latent diffusion models in generation quality on Stacked MNIST, CelebA variants, and molecular generation tasks, while significantly improving reconstruction performance.
📝 Abstract
Diffusion models generate high-dimensional data with remarkable quality, yet how their training efficiently learns the score function, bypassing the curse of dimensionality when data is supported on low-dimensional manifolds, remains theoretically unexplained. We identify a collapse-and-refine mechanism driven by the geometry of the score function itself: at small noise scales, the diverging singularity of the score drives a rapid dimensional collapse of the induced denoising map onto the data manifold projection; at moderate noise scales, training refines the intrinsic density on the learned manifold. We instantiate this principle as Score-induced Latent Diffusion (SiLD), a two-stage framework in which both manifold learning and density estimation emerge from a single denoising score matching objective, replacing the heuristic KL regularization of VAE-based latent diffusion models. We prove that the resulting sample complexity depends on the intrinsic dimension rather than the ambient dimension. Experiments on Stacked MNIST, CelebA variants, and molecular generation benchmarks show that SiLD matches or outperforms VAE-based LDMs in generation quality and consistently improves reconstruction, validating our theoretical predictions.