Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion autoencoders (DAEs) suffer from blurred structural reconstruction and loss of fine details due to linear-β noise scheduling, which concentrates most sampling steps in high-noise regimes. To address this, we propose a structure-detail decoupled two-stage training paradigm: Stage I enforces learning of robust, structured latent representations exclusively at the maximum noise level; Stage II refines texture and detail recovery within low-noise regions. This framework is the first to explicitly decouple the structural recovery and detail synthesis constraints inherently entangled in conventional noise scheduling. It introduces latent-space structural regularization and stage-wise supervision to enforce hierarchical fidelity. Experiments demonstrate substantial improvements in reconstruction fidelity—particularly for edges and textures—while preserving semantic editability of latent codes. Our approach establishes a novel paradigm for high-fidelity, controllable image reconstruction.

Technology Category

Application Category

📝 Abstract
Diffusion autoencoders (DAEs) are typically formulated as a noise prediction model and trained with a linear-$eta$ noise schedule that spends much of its sampling steps at high noise levels. Because high noise levels are associated with recovering large-scale image structures and low noise levels with recovering details, this configuration can result in low-quality and blurry images. However, it should be possible to improve details while spending fewer steps recovering structures because the latent code should already contain structural information. Based on this insight, we propose a new DAE training method that improves the quality of reconstructed images. We divide training into two phases. In the first phase, the DAE is trained as a vanilla autoencoder by always setting the noise level to the highest, forcing the encoder and decoder to populate the latent code with structural information. In the second phase, we incorporate a noise schedule that spends more time in the low-noise region, allowing the DAE to learn how to perfect the details. Our method results in images that have accurate high-level structures and low-level details while still preserving useful properties of the latent codes.
Problem

Research questions and friction points this paper is trying to address.

Improving image reconstruction quality in diffusion autoencoders
Reducing blurriness by optimizing noise level scheduling
Enhancing structural and detail recovery in two-phase training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase training for diffusion autoencoders
Highest noise level in initial autoencoder phase
Low-noise focus in second detail refinement phase
🔎 Similar Papers
No similar papers found.
P
Pramook Khungurn
pixiv, Inc.
S
Sukit Seripanitkarn
VISTEC
P
Phonphrm Thawatdamrongkit
VISTEC
Supasorn Suwajanakorn
Supasorn Suwajanakorn
VISTEC
VisionDeep LearningGraphics