Exploring Representation-Aligned Latent Space for Better Generation

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing latent diffusion models suffer from weak semantic representation in VAE latent spaces, leading to distorted image details and degraded perceptual quality. To address this, we propose ReaLS—a semantics-aligned latent space that jointly models explicit high-level semantic priors (e.g., semantic segmentation, depth) with latent representations, enhancing both interpretability and semantic fidelity without sacrificing compression efficiency. Methodologically, we extend the VAE architecture with a semantics-aligned loss and integrate DiT/SiT backbones to enable end-to-end diffusion model training in ReaLS. Evaluated on standard benchmarks, our approach achieves a 15% reduction in FID and significantly improves performance on perception-oriented downstream tasks—including semantic segmentation and depth estimation—thereby overcoming the longstanding limitation of conventional VAEs, which prioritize pixel-level reconstruction over semantic consistency.

Technology Category

Application Category

📝 Abstract
Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and video synthesis. Latent diffusion models are typically trained using Variational Autoencoders (VAEs), interacting with VAE latents rather than the real samples. While this generative paradigm speeds up training and inference, the quality of the generated outputs is limited by the latents' quality. Traditional VAE latents are often seen as spatial compression in pixel space and lack explicit semantic representations, which are essential for modeling the real world. In this paper, we introduce ReaLS (Representation-Aligned Latent Space), which integrates semantic priors to improve generation performance. Extensive experiments show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. Furthermore, the enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
Problem

Research questions and friction points this paper is trying to address.

Image Quality
Semantic Information
Generative Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReaLS
Image Semantics Enhancement
FID Improvement
🔎 Similar Papers
No similar papers found.