Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key bottlenecks in long-video understanding by multimodal large language models (MLLMs)—including weak cross-modal interaction, severe hallucination, and imbalanced multi-task difficulty—this paper proposes VITAL. First, it introduces a tool-augmented multimodal reasoning mechanism: visual tools dynamically sample salient frames and generate multimodal chain-of-thought (MoT-CoT) reasoning, strengthening vision–language alignment. Second, it constructs two large-scale, multi-task video datasets—MTVR-CoT-72k and MTVR-RL-110k—to support supervised and reinforcement learning. Third, it proposes Difficulty-Guided Reinforcement Policy Optimization (DGRPO), a novel RL algorithm that jointly optimizes question answering and temporal grounding via difficulty-aware reward shaping. Evaluated on 11 diverse video understanding benchmarks, VITAL achieves state-of-the-art performance across the board, with particularly significant gains on long-video QA and fine-grained temporal localization tasks.

Technology Category

Application Category

📝 Abstract
The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. All code, data and model weight will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video reasoning in MLLMs for long videos
Reducing cross-modal interaction limitations and hallucinations
Improving multi-task learning in video understanding benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-augmented learning for dense video frame sampling
Multimodal chain-of-thought for precise reasoning
Difficulty-aware policy optimization for multi-task learning
🔎 Similar Papers
No similar papers found.
H
Haoji Zhang
Tsinghua Shenzhen International Graduate School, Tsinghua University
X
Xin Gu
University of Chinese Academy of Sciences
J
Jiawen Li
Bytedance Intelligent Creation
C
Chixiang Ma
Bytedance Intelligent Creation
S
Sule Bai
Tsinghua Shenzhen International Graduate School, Tsinghua University
Chubin Zhang
Chubin Zhang
Tsinghua University
Embodied AI3D Vision
B
Bowen Zhang
Bytedance Intelligent Creation
Zhichao Zhou
Zhichao Zhou
ShanghaiTech University
software engineering
Dongliang He
Dongliang He
ByteDance Inc.
Computer VisionDeep LearningMultimedia
Y
Yansong Tang
Tsinghua Shenzhen International Graduate School, Tsinghua University