Distribution Matching Variational AutoEncoder

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual generative models (e.g., VAEs) implicitly constrain the latent space without explicit prior modeling, obscuring the relationship between latent structure and generation performance. To address this, we propose DMVAE—a novel variational autoencoder incorporating a differentiable distribution matching constraint that explicitly aligns the encoder’s posterior with an arbitrary reference distribution (e.g., self-supervised learning feature distributions or diffusion noise distributions), thereby overcoming the limitations of conventional Gaussian priors. Our method enables flexible prior selection and reveals that SSL-derived feature distributions achieve an optimal trade-off between reconstruction fidelity and modeling efficiency. On ImageNet, DMVAE achieves a gFID of 3.2 using only 64 training epochs—significantly narrowing the gap between tractable latent representations and high-fidelity image synthesis.

Technology Category

Application Category

📝 Abstract
Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce extbf{Distribution-Matching VAE} ( extbf{DMVAE}), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.
Problem

Research questions and friction points this paper is trying to address.

Aligns encoder latent distribution with arbitrary reference distribution
Generalizes beyond Gaussian prior to self-supervised or diffusion distributions
Investigates optimal latent distributions for reconstruction fidelity and modeling efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicitly aligns encoder latent with reference distribution
Generalizes beyond Gaussian prior to arbitrary distributions
Uses SSL-derived distributions for fidelity-efficiency balance
S
Sen Ye
Peking University
J
Jianning Pei
UCAS
Mengde Xu
Mengde Xu
Huazhong University of Science and Technology
Computer vision
Shuyang Gu
Shuyang Gu
Microsoft Research Asia
computer visiongenerative model
C
Chunyu Wang
Tencent
L
Liwei Wang
Peking University
H
Han Hu
Tencent