DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoencoders suffer from training instability due to GAN integration and latent-space redundancy under high compression ratios. Method: This paper proposes the Diffusion-Guided Autoencoder (DGAE), the first framework to embed a conditional diffusion prior into the decoding process—replacing conventional reconstruction with progressive denoising, and jointly optimizing perceptual loss and reconstruction consistency. DGAE integrates a variational autoencoder architecture with a conditional diffusion model, enabling compact latent representations without compromising generation quality. Contribution/Results: DGAE reduces latent dimensionality by 50%, substantially mitigating over-parameterization while maintaining state-of-the-art (SOTA) performance on both image reconstruction and generation tasks on ImageNet-1K. The diffusion module converges approximately 40% faster than baseline approaches, achieving an effective balance among high compression ratio, training stability, and representation compactness.

Technology Category

Application Category

📝 Abstract
Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder's expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.
Problem

Research questions and friction points this paper is trying to address.

Reducing training instability caused by GAN in autoencoders
Minimizing latent space dimensionality for efficient representation
Improving decoder expressiveness with diffusion model guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion model to guide decoder
Reduces latent space dimensionality by 2x
Improves performance under high compression
🔎 Similar Papers
No similar papers found.
D
Dongxu Liu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Yuang Peng
Yuang Peng
Tsinghua University
Generative ModelMultimodal Learning
H
Haomiao Tang
Tsinghua University, Beijing, China
Y
Yuwei Chen
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
C
Chunrui Han
StepFun, Beijing, China
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models
M
Mingxue Liao
Institute of Automation, Chinese Academy of Sciences, Beijing, China