Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Redundant high-frequency components in high-dimensional VAE latent spaces significantly impede diffusion model convergence and degrade generative quality. This work identifies, for the first time, this spectral mismatch problem and proposes two novel, vision-language-model-free strategies: Spectral Self-Regularization and Spectral Alignment. These methods jointly optimize denoising capability and reconstruction fidelity by explicitly constraining the frequency-domain distribution of latent representations. We introduce a Denoising-VAE built upon a Vision Transformer (ViT) architecture. On ImageNet 256×256, it achieves state-of-the-art reconstruction performance (rFID = 0.28, PSNR = 27.26) and generation quality (gFID = 1.82), while accelerating diffusion training convergence by approximately 2× compared to prior approaches.

Technology Category

Application Category

📝 Abstract
Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$ imes$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$ imes$256 benchmark.
Problem

Research questions and friction points this paper is trying to address.

High-dimensional VAE latents hinder diffusion model training convergence
Redundant high-frequency components degrade generative quality in VAEs
Spectral noise in latent space optimization reduces reconstruction fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral self-regularization suppresses redundant high-frequency noise
Denoising-VAE produces cleaner lower-noise latents without VFMs
Spectral alignment strategy accelerates diffusion model convergence
🔎 Similar Papers
No similar papers found.
Xunzhi Xiang
Xunzhi Xiang
Nanjing University
X
Xingye Tian
Kling Team, Kuaishou Technology
Guiyu Zhang
Guiyu Zhang
The Chinese University of Hong Kong (Shenzhen)
Computer visionPattern RecognitionMachine Learning
Yabo Chen
Yabo Chen
Shanghai Jiaotong University
Self-supervised Learning
S
Shaofeng Zhang
University of Science and Technology of China
X
Xuebo Wang
Kling Team, Kuaishou Technology
Xin Tao
Xin Tao
Kuaishou
Computer VisionGenerative AI
Q
Qi Fan
Nanjing University