Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video autoencoders struggle to efficiently model dynamic spatiotemporal redundancy, resulting in low compression ratios and high computational costs for downstream tasks. To address this, we propose a hierarchical video autoencoding framework: (1) a dual-stream self-supervised motion encoder explicitly disentangles global (coarse-grained) motion from local (high-frequency spatial) motion, yielding an interpretable and scalable hierarchical latent space; and (2) the first integration of hierarchical latent representations with conditional diffusion-based decoding for high-fidelity reconstruction. Our method achieves state-of-the-art reconstruction quality while attaining a 1428× compression ratio—nearly 30× higher than Cosmos-VAE—and substantially reduces training overhead for downstream generative tasks.

Technology Category

Application Category

📝 Abstract
Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then reconstructs videos by combining hierarchical global and detailed motions, enabling high-fidelity video reconstructions. Extensive experiments demonstrate that Hi-VAE achieves a high compression factor of 1428$ imes$, almost 30$ imes$ higher than baseline methods (e.g., Cosmos-VAE at 48$ imes$), validating the efficiency of our approach. Meanwhile, Hi-VAE maintains high reconstruction quality at such high compression rates and performs effectively in downstream generative tasks. Moreover, Hi-VAE exhibits interpretability and scalability, providing new perspectives for future exploration in video latent representation and generation.
Problem

Research questions and friction points this paper is trying to address.

Efficiently model spatio-temporal redundancies in video dynamics
Reduce excessive training costs for downstream tasks
Achieve high compression while maintaining reconstruction quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical coarse-to-fine motion encoding
Separate global and detailed motion spaces
Conditional diffusion decoder for reconstruction