ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
Existing video agents struggle to perform fine-grained compositional reasoning or effectively map high-level intentions to executable actions due to coarse tool spaces and flat action spaces. To address this, this work proposes the ReTool-Video framework, which introduces a Meta-Augmented Video Tool Library (MVTL) comprising 134 tools and a recursive tool invocation mechanism. This mechanism dynamically decomposes, refines, or replaces high-level video intentions into multimodal tool chains, enabling dual-layer access to both structured information and raw modal evidence. Evaluated on MVBench, MLVU, and Video-MME (without subtitles), the approach significantly outperforms strong baselines, demonstrating that recursive tool grounding combined with fine-grained meta-tools enhances both performance and robustness in complex video understanding tasks.
📝 Abstract
Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.
Problem

Research questions and friction points this paper is trying to address.

video understanding
tool-augmented agents
compositional reasoning
action grounding
meta tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

recursive tool use
meta-augmented tool grounding
video reasoning
tool library
multimodal video understanding
🔎 Similar Papers
X
Xiao Liu
Chongqing University
N
Nayu Liu
Tianjin University
Junnan Zhu
Junnan Zhu
Institute of Automation Chinese Academy of Sciences
Natural Language Processing
R
Ruirui Chen
Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore
G
Guohui Xiang
Chongqing National Data AI Research Institute, AI Research Lab
C
Changjian Wang
Chongqing National Data AI Research Institute, AI Research Lab
K
Kaiwen Wei
Chongqing University
R
Rongzhen Li
Chongqing National Data AI Research Institute, AI Research Lab
J
Jiang Zhong
Chongqing University