Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing video autoencoders struggle to efficiently model dynamic spatiotemporal redundancy, resulting in low compression ratios and high computational costs for downstream tasks. To address this, we propose a hierarchical video autoencoding framework: (1) a dual-stream self-supervised motion encoder explicitly disentangles global (coarse-grained) motion from local (high-frequency spatial) motion, yielding an interpretable and scalable hierarchical latent space; and (2) the first integration of hierarchical latent representations with conditional diffusion-based decoding for high-fidelity reconstruction. Our method achieves state-of-the-art reconstruction quality while attaining a 1428× compression ratio—nearly 30× higher than Cosmos-VAE—and substantially reduces training overhead for downstream generative tasks.

Technology Category

Application Category

📝 Abstract

Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then reconstructs videos by combining hierarchical global and detailed motions, enabling high-fidelity video reconstructions. Extensive experiments demonstrate that Hi-VAE achieves a high compression factor of 1428$ imes$, almost 30$ imes$ higher than baseline methods (e.g., Cosmos-VAE at 48$ imes$), validating the efficiency of our approach. Meanwhile, Hi-VAE maintains high reconstruction quality at such high compression rates and performs effectively in downstream generative tasks. Moreover, Hi-VAE exhibits interpretability and scalability, providing new perspectives for future exploration in video latent representation and generation.

Problem

Research questions and friction points this paper is trying to address.

Efficiently model spatio-temporal redundancies in video dynamics

Reduce excessive training costs for downstream tasks

Achieve high compression while maintaining reconstruction quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical coarse-to-fine motion encoding

Separate global and detailed motion spaces

Conditional diffusion decoder for reconstruction

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

2024-10-10arXiv.orgCitations: 0

TikTok

San Jose, California

Research Engineer/Scientist (all levels), Efficient Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence