Recurrent Video Masked Autoencoders

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the inefficiency in spatiotemporal modeling, high computational complexity, and difficulty in jointly capturing video-level and pixel-level semantics in video representation learning, this paper proposes the Recurrent Vision Model (RVM)—a lightweight video self-encoding framework built upon a recurrent Transformer. Its core innovation is the first-ever recurrent video masked autoencoding paradigm, which performs temporal aggregation of dense frame-wise features and asymmetric masked reconstruction to enable joint spatiotemporal modeling. RVM achieves linear-complexity feature propagation over long sequences, improves parameter efficiency by 30× over SOTA methods, and supports end-to-end training without knowledge distillation. On action recognition and object tracking, RVM matches VideoMAE and V-JEPA; on geometric and dense spatial understanding tasks, it surpasses DINOv2—particularly with compact architectures. Visualization confirms its capability for unified semantic, structural, and motion representation.

Technology Category

Application Category

📝 Abstract

We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.

Problem

Research questions and friction points this paper is trying to address.

Develops a recurrent video autoencoder for spatiotemporal representation learning

Achieves efficient video and image task performance without knowledge distillation

Enables stable long-term feature propagation with linear computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent transformer aggregates dense image features over time

Asymmetric masked prediction with pixel reconstruction objective

Linear computational cost for stable long-term feature propagation

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Toyota Research Institute

Los Altos, CA

AI Research Scientist, Computer Vision - Facebook Video Intelligence