π€ AI Summary
Existing flow matching models assume data follow simple source distributions (e.g., standard Gaussian), failing to effectively capture the intrinsic low-dimensional manifold structure of high-dimensional real-world dataβleading to low learning efficiency, poor multimodal modeling capability, and high training costs.
Method: This work introduces the first integration of pre-trained deep latent variable models (VAEs or VAE-GANs) into the continuous flow matching framework, enabling manifold-aware flow learning. Our approach jointly leverages latent-space modeling, continuous normalizing flows, and manifold alignment optimization, supporting physics-constrained generation and latent-space conditional control.
Results: Experiments demonstrate substantial improvements: image generation quality increases while training cost drops by ~50%; generated samples in the Darcy flow task better satisfy underlying physical laws; and multimodal structure modeling capability and generalization are significantly enhanced.
π Abstract
Flow matching models have shown great potential in image generation tasks among probabilistic generative models. Building upon the ideas of continuous normalizing flows, flow matching models generalize the transport path of the diffusion models from a simple prior distribution to the data. Most flow matching models in the literature do not explicitly model the underlying structure/manifold in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. Existing strategies of incorporating manifolds, including data with underlying multi-modal distribution, often require expensive training and hence frequently lead to suboptimal performance. To this end, we present exttt{Latent-CFM}, which provides simplified training/inference strategies to incorporate multi-modal data structures using pretrained deep latent variable models. Through experiments on multi-modal synthetic data and widely used image benchmark datasets, we show that exttt{Latent-CFM} exhibits improved generation quality with significantly less training ($sim 50%$ less in some cases) and computation than state-of-the-art flow matching models. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competitive approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features.