VideoSSR: Video Self-Supervised Reinforcement Learning

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity and high cost of high-quality annotated data in video understanding, this paper proposes a fully self-supervised video pretraining paradigm. Methodologically, it introduces three verifiable pretext tasks—Anomaly Grounding, Object Counting, and Temporal Jigsaw—designed to exploit intrinsic spatiotemporal consistency in videos, coupled with a reinforcement learning–driven verifiable reward mechanism. Based on this framework, we establish VIUBench, a challenging benchmark for evaluating video understanding, and VideoSSR-30K, a large-scale self-supervised video dataset. Extensive experiments across 17 video understanding benchmarks demonstrate that VideoSSR achieves over 5% average improvement on long-video question answering, temporal localization, and complex reasoning tasks. The approach significantly advances multimodal large models’ video comprehension capabilities while enhancing scalability and robustness of self-supervised learning.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at https://github.com/lcqysl/VideoSSR.
Problem

Research questions and friction points this paper is trying to address.

Addressing the scarcity of complex annotated video datasets for MLLMs
Investigating self-generation of verifiable training data from video content
Developing self-supervised reinforcement learning for video understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pretext tasks generate verifiable training data
VideoSSR framework enhances multimodal large language models
Improves video understanding across multiple domains by 5%
🔎 Similar Papers
Z
Zefeng He
Shanghai Artificial Intelligence Laboratory, Nanjing University
Xiaoye Qu
Xiaoye Qu
Shanghai AI Lab
Yafu Li
Yafu Li
The Chinese University of Hong Kong
ReasoningTrustworthy AIMultilinguality
S
Siyuan Huang
Shanghai Jiao Tong University
Daizong Liu
Daizong Liu
Wuhan University
Computer VisionVision and Language3D UnderstandingAdversarial RobustnessLVLM
Y
Yu Cheng
The Chinese University of Hong Kong