VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to evaluate the capability of multimodal large language models (MLLMs) to perform long-horizon, multi-step tool invocation in complex visual tasks. To address this gap, this work introduces a novel evaluation benchmark grounded in real-world computer vision workflows, integrating 32 OpenCV tools and 680 hierarchically structured tasks. It presents the first fine-grained, multi-level evaluation framework that enables systematic assessment of both multi-tool composition and generalization to unseen operations. Comprehensive experiments across 19 state-of-the-art models reveal significant bottlenecks in tool adaptability and compositional planning, with even the best-performing model, Gemini-3.0-Pro, achieving only 51% accuracy—highlighting the current limitations of vision agents in complex tool-based reasoning.

Technology Category

Application Category

📝 Abstract
Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench~(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51\% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
tool composition
visual reasoning
benchmark evaluation
agentic capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool chaining
multimodal agentic models
visual reasoning benchmark
compositional tool use
OpenCV-based operations
🔎 Similar Papers
No similar papers found.