🤖 AI Summary
Existing video autoencoders struggle to efficiently model dynamic spatiotemporal redundancy, resulting in low compression ratios and high computational costs for downstream tasks. To address this, we propose a hierarchical video autoencoding framework: (1) a dual-stream self-supervised motion encoder explicitly disentangles global (coarse-grained) motion from local (high-frequency spatial) motion, yielding an interpretable and scalable hierarchical latent space; and (2) the first integration of hierarchical latent representations with conditional diffusion-based decoding for high-fidelity reconstruction. Our method achieves state-of-the-art reconstruction quality while attaining a 1428× compression ratio—nearly 30× higher than Cosmos-VAE—and substantially reduces training overhead for downstream generative tasks.
📝 Abstract
Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then reconstructs videos by combining hierarchical global and detailed motions, enabling high-fidelity video reconstructions. Extensive experiments demonstrate that Hi-VAE achieves a high compression factor of 1428$ imes$, almost 30$ imes$ higher than baseline methods (e.g., Cosmos-VAE at 48$ imes$), validating the efficiency of our approach. Meanwhile, Hi-VAE maintains high reconstruction quality at such high compression rates and performs effectively in downstream generative tasks. Moreover, Hi-VAE exhibits interpretability and scalability, providing new perspectives for future exploration in video latent representation and generation.