Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Diffusion models often suffer from the curse of dimensionality in high-dimensional data, struggling to capture complex multi-manifold and multimodal structures. This work proposes a Mixture of Low-Rank Gaussian Subspaces (MoLR-MoG) approach, modeling data as a union of low-dimensional linear subspaces, each endowed with multimodal Gaussian latent variables. A Mixture-of-Experts (MoE)-structured score function is designed to jointly capture nonlinear geometry and multimodal distributions. MoLR-MoG is the first method to integrate multiple subspaces with multimodal Gaussian priors, moving beyond the conventional single-Gaussian assumption. It achieves significantly superior generation quality on real-world data compared to baselines, matching the performance of MoE-Unet with only one-tenth of the parameters. Theoretical analysis further establishes error bounds and optimization convergence guarantees that effectively mitigate the curse of dimensionality.

Technology Category

Application Category

📝 Abstract

Recently, diffusion models have achieved a great performance with a small dataset of size $n$ and a fast optimization process. However, the estimation error of diffusion models suffers from the curse of dimensionality $n^{-1/D}$ with the data dimension $D$. Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a $1/\sqrt{n}$ bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of $K$ linear subspaces, and each subspace admits a mixture of Gaussian latent ($n_k$ modals with dimension $d_k$). With this modeling, the corresponding score function naturally has a mixture of expert (MoE) structure, captures the multi-modal information, and contains nonlinear property. We first conduct real-world experiments to show that the generation results of MoE-latent MoG NN are much better than MoE-latent Gaussian score. Furthermore, MoE-latent MoG NN achieves a comparable performance with MoE-latent Unet with $10 \times$ parameters. These results indicate that the MoLR-MoG modeling is reasonable and suitable for real-world data. After that, based on such MoE-latent MoG score, we provide a $R^4\sqrt{\Sigma_{k=1}^Kn_k}\sqrt{\Sigma_{k=1}^Kn_kd_k}/\sqrt{n}$ estimation error, which escapes the curse of dimensionality by using data structure. Finally, we study the optimization process and prove the convergence guarantee under the MoLR-MoG modeling. Combined with these results, under a setting close to real-world data, this work explains why diffusion models only require a small training sample and enjoy a fast optimization process to achieve a great performance.

Problem

Research questions and friction points this paper is trying to address.

diffusion models

curse of dimensionality

multi-modal

multi-subspace

estimation error

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models

mixture of experts

multi-subspace modeling