ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing video agents struggle to perform fine-grained compositional reasoning or effectively map high-level intentions to executable actions due to coarse tool spaces and flat action spaces. To address this, this work proposes the ReTool-Video framework, which introduces a Meta-Augmented Video Tool Library (MVTL) comprising 134 tools and a recursive tool invocation mechanism. This mechanism dynamically decomposes, refines, or replaces high-level video intentions into multimodal tool chains, enabling dual-layer access to both structured information and raw modal evidence. Evaluated on MVBench, MLVU, and Video-MME (without subtitles), the approach significantly outperforms strong baselines, demonstrating that recursive tool grounding combined with fine-grained meta-tools enhances both performance and robustness in complex video understanding tasks.

📝 Abstract

Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

Problem

Research questions and friction points this paper is trying to address.

video understanding

tool-augmented agents

compositional reasoning

action grounding

meta tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

recursive tool use

meta-augmented tool grounding

video reasoning