Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video reasoning methods lack an autonomous reasoning mechanism analogous to the “chain-of-thought” (CoT) paradigm in image understanding. This paper introduces Video-Thinker, the first end-to-end video reasoning framework that extends the image CoT paradigm from multimodal large language models (MLLMs) to video, without requiring external tools. Video-Thinker employs reinforcement learning to enable the model to autonomously invoke its built-in spatiotemporal localization and descriptive capabilities during inference, generating temporally coherent reasoning traces. It adopts joint training via supervised fine-tuning and grouped relative policy optimization. Furthermore, we construct Video-Thinker-10K—the first video reasoning dataset explicitly designed for autonomous tool invocation—containing 10K high-quality samples. On multiple video reasoning benchmarks, our 7B-parameter model substantially outperforms strong baselines such as Video-R1 and achieves state-of-the-art performance among models of comparable scale.

Technology Category

Application Category

📝 Abstract
Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Extends dynamic reasoning paradigm to video tasks
Enables autonomous grounding and captioning for video reasoning
Eliminates need for external tools in video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning to enhance video reasoning
Autonomously leverages grounding and captioning capabilities
Employs supervised fine-tuning and group policy optimization
🔎 Similar Papers
No similar papers found.
S
Shijian Wang
Southeast University, Monash University, Xiaohongshu Inc.
Jiarui Jin
Jiarui Jin
Xiaohongshu; Shanghai Jiao Tong University; University College London
Multimodal MiningRecommender SystemInformation RetrievalLarge Language Model
X
Xingjian Wang
Monash University
L
Linxin Song
University of Southern California
R
Runhao Fu
Monash University
H
Hecheng Wang
Fudan University
Z
Zongyuan Ge
Monash University
Yuan Lu
Yuan Lu
I-squared-R
BlockchainsDistributed ComputingDecentralization
Xuelian Cheng
Xuelian Cheng
Monash University
3D VisionMedical ImagingMachine Learning