Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mask-based video modeling (MVM) approaches rely heavily on hand-crafted masking heuristics and lack explicit motion awareness. To address this, we propose Trajectory-Aware Token Sampling (TATS), the first framework that jointly optimizes a masking policy and a masked autoencoder (MAE) end-to-end via Proximal Policy Optimization (PPO). TATS dynamically identifies motion-centric tokens—those exhibiting high spatiotemporal trajectory variance—and prioritizes them for masking, enabling motion-driven, high-ratio masking (≥85%) while preserving reconstruction fidelity and accelerating pretraining. Evaluated on four major action recognition benchmarks—Something-Something v2, Kinetics-400, UCF101, and HMDB51—TATS achieves state-of-the-art performance. It reduces memory overhead by 17% compared to standard MAE, exhibits superior cross-dataset generalization, and enhances downstream transferability across diverse video understanding tasks.

Technology Category

Application Category

📝 Abstract
Masked video modeling~(MVM) has emerged as a highly effective pre-training strategy for visual foundation models, whereby the model reconstructs masked spatiotemporal tokens using information from visible tokens. However, a key challenge in such approaches lies in selecting an appropriate masking strategy. Previous studies have explored predefined masking techniques, including random and tube-based masking, as well as approaches that leverage key motion priors, optical flow and semantic cues from externally pre-trained models. In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO). We show that our model allows for aggressive masking without compromising performance on the downstream task of action recognition while also ensuring that the pre-training remains memory efficient. Extensive experiments of the proposed approach across four benchmarks, including Something-Something v2, Kinetics-400, UCF101, and HMDB51, demonstrate the effectiveness, transferability, generalization, and efficiency of our work compared to other state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Selecting optimal masking strategy for video modeling
Joint training of MAE and adaptive token sampler
Improving action recognition with aggressive masking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory-Aware Adaptive Token Sampler (TATS)
Joint optimization of MAE and TATS using PPO
Aggressive masking without performance compromise
🔎 Similar Papers
No similar papers found.
A
Ayush K. Rai
Insight Research Ireland Centre for Data Analytics, Dublin City University
Kyle Min
Kyle Min
Intel Labs
Computer Vision
T
Tarun Krishna
Insight Research Ireland Centre for Data Analytics, Dublin City University
F
Feiyan Hu
Insight Research Ireland Centre for Data Analytics, Dublin City University
A
Alan F. Smeaton
Insight Research Ireland Centre for Data Analytics, Dublin City University
Noel E. O'Connor
Noel E. O'Connor
CEO, Insight Centre for Data Analytics, Dublin City University
Multimedia content analysisinformation retrievalmachine learningartificial intelligencecomputer vision