M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing benchmarks lack systematic evaluation of multimodal tool orchestration under model-context protocols, especially for complex multi-hop, multithreaded workflows involving visual grounding, cross-tool dependencies, and persistent multi-step intermediate states. Method: We introduce MToolBench—the first benchmark dedicated to multimodal tool usage—comprising 231 real-world tools deployed across 28 servers, enabling end-to-end multimodal workflow evaluation. It features a novel similarity-driven trajectory alignment method: tool signatures are embedded via sentence encoders; similarity-based bucketing and the Hungarian algorithm jointly enable auditable, one-to-one tool-call matching, decoupling semantic fidelity from procedural consistency. Execution traces are validated by an integrated executor and a four-model adjudication panel. Contribution/Results: Experiments reveal significant bottlenecks in current multimodal LMs regarding parameter-level accuracy and structural coherence, underscoring the urgent need for joint image–text–tool graph reasoning.

Technology Category

Application Category

📝 Abstract

We present M^3-Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching to obtain auditable one-to-one correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-the-art Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs. Our Benchmark's anonymous repository is at https://github.com/EtaYang10th/Open-M3-Bench

Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal tool use requiring visual grounding and textual reasoning

Measures cross-tool dependencies and persistence of intermediate resources

Assesses argument fidelity and structure consistency in tool-using workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Serializes tool calls using similarity-driven alignment

Embeds tool signatures with sentence encoder

Uses Hungarian matching for auditable correspondences

🔎 Similar Papers

COMMA: A Communicative Multimodal Multi-Agent Benchmark