SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing video self-supervised learning methods (e.g., VideoMAE) rely on natural video pixel reconstruction, suffering from temporal redundancy, weak semantic representation, and insufficient motion modeling. To address these limitations, we propose SMILE—a novel self-supervised framework that jointly leverages high-level spatial semantics (guided by CLIP) and controllable synthetic motion patterns. SMILE is the first to learn general-purpose video representations exclusively from synthetic motion signals, eliminating dependence on natural video data. Our approach integrates masked video modeling with cross-modal alignment optimization to jointly capture structural semantics and dynamic motion priors. Evaluated on seven downstream tasks—including action recognition, temporal localization, and video retrieval—SMILE consistently outperforms state-of-the-art methods, achieving significant gains in representation discriminability and generalization. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: https://github.com/fmthoker/SMILE

Problem

Research questions and friction points this paper is trying to address.

Enhance video SSL with spatial and motion semantics

Overcome pixel-level redundancy in natural video data

Learn discriminative video representations without natural videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Infuses spatial and motion semantics in learning

Uses image-language models to guide spatial semantics

Introduces synthetic motion patterns for dynamic content

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs