🤖 AI Summary
Existing video VAEs overly prioritize reconstruction fidelity while neglecting the impact of latent-space spectral structure on diffusion training, resulting in slow convergence and suboptimal generation quality. This work identifies two intrinsic properties of video latent representations: pronounced low-frequency spatiotemporal distribution and channel-wise modality concentration. Leveraging these insights, we propose two lightweight, general-purpose structured regularization techniques—local correlation regularization and latent masking reconstruction. These methods jointly guide the VAE to learn a spectrally controllable and structurally coherent latent space, significantly improving both training efficiency and generative performance of subsequent diffusion models. Experiments demonstrate a 3× speedup in diffusion training, a 10% improvement in video reward scores, and consistent superiority over leading open-source video VAEs across multiple quantitative metrics.
📝 Abstract
Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3 imes$ speedup in text-to-video generation convergence and a 10% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.