🤖 AI Summary
Existing mask-based video modeling (MVM) approaches rely heavily on hand-crafted masking heuristics and lack explicit motion awareness. To address this, we propose Trajectory-Aware Token Sampling (TATS), the first framework that jointly optimizes a masking policy and a masked autoencoder (MAE) end-to-end via Proximal Policy Optimization (PPO). TATS dynamically identifies motion-centric tokens—those exhibiting high spatiotemporal trajectory variance—and prioritizes them for masking, enabling motion-driven, high-ratio masking (≥85%) while preserving reconstruction fidelity and accelerating pretraining. Evaluated on four major action recognition benchmarks—Something-Something v2, Kinetics-400, UCF101, and HMDB51—TATS achieves state-of-the-art performance. It reduces memory overhead by 17% compared to standard MAE, exhibits superior cross-dataset generalization, and enhances downstream transferability across diverse video understanding tasks.
📝 Abstract
Masked video modeling~(MVM) has emerged as a highly effective pre-training strategy for visual foundation models, whereby the model reconstructs masked spatiotemporal tokens using information from visible tokens. However, a key challenge in such approaches lies in selecting an appropriate masking strategy. Previous studies have explored predefined masking techniques, including random and tube-based masking, as well as approaches that leverage key motion priors, optical flow and semantic cues from externally pre-trained models. In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO). We show that our model allows for aggressive masking without compromising performance on the downstream task of action recognition while also ensuring that the pre-training remains memory efficient. Extensive experiments of the proposed approach across four benchmarks, including Something-Something v2, Kinetics-400, UCF101, and HMDB51, demonstrate the effectiveness, transferability, generalization, and efficiency of our work compared to other state-of-the-art methods.