Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video VAEs overly prioritize reconstruction fidelity while neglecting the impact of latent-space spectral structure on diffusion training, resulting in slow convergence and suboptimal generation quality. This work identifies two intrinsic properties of video latent representations: pronounced low-frequency spatiotemporal distribution and channel-wise modality concentration. Leveraging these insights, we propose two lightweight, general-purpose structured regularization techniques—local correlation regularization and latent masking reconstruction. These methods jointly guide the VAE to learn a spectrally controllable and structurally coherent latent space, significantly improving both training efficiency and generative performance of subsequent diffusion models. Experiments demonstrate a 3× speedup in diffusion training, a 10% improvement in video reward scores, and consistent superiority over leading open-source video VAEs across multiple quantitative metrics.

Technology Category

Application Category

📝 Abstract
Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3 imes$ speedup in text-to-video generation convergence and a 10% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.
Problem

Research questions and friction points this paper is trying to address.

Analyzes spectral properties of video VAE latents for diffusion training
Proposes regularizers to bias latents toward low frequencies and few modes
Improves convergence speed and quality in text-to-video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Biasing latent spectrum for easier diffusion training
Introducing lightweight regularizers for spectral properties
Achieving faster convergence and higher video quality
🔎 Similar Papers
No similar papers found.