🤖 AI Summary
Existing benchmarks lack systematic evaluation of multimodal tool orchestration under model-context protocols, especially for complex multi-hop, multithreaded workflows involving visual grounding, cross-tool dependencies, and persistent multi-step intermediate states.
Method: We introduce MToolBench—the first benchmark dedicated to multimodal tool usage—comprising 231 real-world tools deployed across 28 servers, enabling end-to-end multimodal workflow evaluation. It features a novel similarity-driven trajectory alignment method: tool signatures are embedded via sentence encoders; similarity-based bucketing and the Hungarian algorithm jointly enable auditable, one-to-one tool-call matching, decoupling semantic fidelity from procedural consistency. Execution traces are validated by an integrated executor and a four-model adjudication panel.
Contribution/Results: Experiments reveal significant bottlenecks in current multimodal LMs regarding parameter-level accuracy and structural coherence, underscoring the urgent need for joint image–text–tool graph reasoning.
📝 Abstract
We present M^3-Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching to obtain auditable one-to-one correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-the-art Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs. Our Benchmark's anonymous repository is at https://github.com/EtaYang10th/Open-M3-Bench