Improving the Diffusability of Autoencoders

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies that abnormally high-frequency components in modern autoencoder latent spaces disrupt the diffusion model’s “coarse-to-fine” generation property, degrading image and video synthesis quality. To address this, we first reveal the detrimental impact of high-frequency noise in latent space on diffusion modeling. We then propose a frequency-domain scale-equivariant regularization method that enforces spectral alignment between RGB and latent representations—achieving lightweight, architecture-agnostic enhancement of latent-space diffusibility. Our approach requires only ≤20K fine-tuning steps and yields substantial improvements: a 19% reduction in FID on ImageNet-1K and ≥44% reduction in FVD on Kinetics-700. It significantly boosts both generative fidelity and training stability without modifying the underlying diffusion or autoencoder architectures.

Technology Category

Application Category

📝 Abstract
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation on Kinetics-700 17x256x256.
Problem

Research questions and friction points this paper is trying to address.

Analyzes high-frequency components in autoencoders
Proposes scale equivariance for better generation quality
Improves image and video generation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral analysis of autoencoders
Scale equivariance regularization strategy
Reducing computational burden via latent representations
🔎 Similar Papers
No similar papers found.