Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context

πŸ“… 2025-12-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video autoencoders commonly couple spatiotemporal modeling, undermining temporal consistency and limiting compression and reconstruction performance. To address this, we propose ARVAEβ€”a novel spatiotemporal disentangled latent representation framework. ARVAE explicitly models inter-frame motion via optical flow fields and implicitly captures novel spatial content through residual representations. It further incorporates an inter-frame autoregressive architecture, enabling lossless compression and reconstruction of arbitrarily long videos. A multi-stage joint optimization strategy is designed to achieve state-of-the-art (SOTA) reconstruction quality even with lightweight models and limited training data. Extensive experiments demonstrate substantial improvements in both reconstruction fidelity and downstream video generation tasks, validating the efficacy of explicit motion modeling and disentangled spatiotemporal representation.

Technology Category

Application Category

πŸ“ Abstract
Video autoencoders compress videos into compact latent representations for efficient reconstruction, playing a vital role in enhancing the quality and efficiency of video generation. However, existing video autoencoders often entangle spatial and temporal information, limiting their ability to capture temporal consistency and leading to suboptimal performance. To address this, we propose Autoregressive Video Autoencoder (ARVAE), which compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner, allowing flexible processing of videos with arbitrary lengths. ARVAE introduces a temporal-spatial decoupled representation that combines downsampled flow field for temporal coherence with spatial relative compensation for newly emerged content, achieving high compression efficiency without information loss. Specifically, the encoder compresses the current and previous frames into the temporal motion and spatial supplement, while the decoder reconstructs the original frame from the latent representations given the preceding frame. A multi-stage training strategy is employed to progressively optimize the model. Extensive experiments demonstrate that ARVAE achieves superior reconstruction quality with extremely lightweight models and small-scale training data. Moreover, evaluations on video generation tasks highlight its strong potential for downstream applications.
Problem

Research questions and friction points this paper is trying to address.

Decouples spatial and temporal information in video autoencoders to improve temporal consistency.
Enables flexible processing of arbitrary-length videos through autoregressive frame reconstruction.
Achieves high compression efficiency without information loss using a decoupled representation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive compression of frames using predecessor context
Decoupled temporal-spatial representation with flow fields
Multi-stage training for lightweight model optimization
πŸ”Ž Similar Papers
2024-07-10arXiv.orgCitations: 3
C
Cuifeng Shen
Zoom Communications
Lumin Xu
Lumin Xu
The Chinese University of Hong Kong
Computer VisionMultimodal LearningDeep Learning
X
Xingguo Zhu
Zoom Communications
G
Gengdai Liu
Zoom Communications