Boosting Latent Diffusion Models via Disentangled Representation Alignment

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing approaches to aligning variational autoencoders (VAEs) with vision foundation models (VFMs) fail to account for the distinct representational requirements of VAEs and latent diffusion models (LDMs), thereby limiting generative performance. This work proposes Semantic-Decoupled VAE (Send-VAE), which employs a nonlinear mapping network to align the VAE latent space with the semantic hierarchy of a pretrained VFM. This alignment enables the VAE to focus on attribute-level disentangled representations while allowing the LDM to leverage high-level semantics for generation. Send-VAE is the first method to explicitly differentiate the representation objectives of VAEs and LDMs, establishing an optimization framework tailored for disentangled learning and revealing a strong correlation between disentanglement degree and generation quality. On ImageNet 256×256, Send-VAE achieves state-of-the-art performance with classifier-free FID of 1.21 and guided FID of 1.75, while significantly accelerating training.

Technology Category

Application Category

📝 Abstract

Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.

Problem

Research questions and friction points this paper is trying to address.

Latent Diffusion Models

Variational Autoencoders

Semantic Disentanglement

Representation Alignment

Vision Foundation Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled representation

Latent Diffusion Models

Vision Foundation Models