🤖 AI Summary
This work addresses the limitations of existing latent diffusion models, whose encoder designs often rely on heuristic approaches and struggle to simultaneously achieve semantic discriminability, reconstruction fidelity, and latent space compactness. To this end, the authors propose the Geometric Autoencoder (GAE), which for the first time systematically unifies semantic alignment, reconstruction robustness, and compression efficiency. GAE leverages a vision foundation model to provide low-dimensional semantic supervision, incorporates latent normalization and dynamic noise sampling, and dispenses with the KL divergence constraint of conventional VAEs to construct a more stable latent manifold. Evaluated on ImageNet-1K at 256×256 resolution, GAE achieves a generative FID (gFID) of 1.82 after only 80 training epochs, further improving to 1.31 at 800 epochs—significantly outperforming current methods while striking an exceptional balance among compression, semantic expressiveness, and reconstruction stability.
📝 Abstract
Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K $256 \times 256$ benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at https://github.com/freezing-index/Geometric-Autoencoder-for-Diffusion-Models.