🤖 AI Summary
This work addresses the sensitivity of latent diffusion models to sampling perturbations, which often arises from variance collapse in the latent space and leads to degraded generation quality. For the first time, this study explicitly identifies the critical role of sampling perturbation robustness in generative performance and proposes a variance-expansion loss to learn perturbation-robust latent representations while preserving high reconstruction fidelity. The method achieves an adaptive balance through an adversarial trade-off between reconstruction accuracy and latent variance, optimized within a β-VAE encoder framework coupled with diffusion-based sampling. Extensive experiments demonstrate consistent improvements in generation quality across diverse latent diffusion architectures, confirming that enhancing robustness in the latent space effectively stabilizes and elevates image synthesis performance.
📝 Abstract
Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $β$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.