DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

πŸ“… 2025-08-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Increasing the number of latent channels in autoencoders improves reconstruction fidelity but severely impedes diffusion model convergence, limiting both generative performance and the practical deployment of highly compressive autoencoders. To address this, we propose a structured latent space design: the leading latent channels explicitly model global structure, while trailing channels preserve fine-grained local details. We further introduce a structural-channel-aware diffusion training objective that accelerates convergence by prioritizing structural consistency. Integrating this framework with a deep-compression autoencoder (DC-AE-1.5-f64c128) and a multi-objective optimization strategy, our method achieves superior generative quality on ImageNet 512Γ—512 compared to DC-AE-f32c32, while reducing training time by 4Γ—. This work effectively breaks the long-standing trade-off between quality and efficiency in latent diffusion models, enabling high-fidelity synthesis under extreme spatial compression.

Technology Category

Application Category

πŸ“ Abstract
We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.
Problem

Research questions and friction points this paper is trying to address.

Slow convergence in diffusion models with high-channel autoencoders
Trade-off between reconstruction quality and generation quality
Limitation in employing autoencoders with higher compression ratios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Latent Space for channel-wise organization
Augmented Diffusion Training with extra objectives
Higher compression ratios with faster convergence
J
Junyu Chen
NVIDIA
D
Dongyun Zou
NVIDIA
W
Wenkun He
NVIDIA
Junsong Chen
Junsong Chen
NVIDIA Research Intern, MMLab@HKU
Generative Model, Large Language Model
Enze Xie
Enze Xie
NVIDIA Research, MMLab@HKU
computer visiongenerative AI
S
Song Han
NVIDIA
H
Han Cai
NVIDIA