🤖 AI Summary
Existing research on multi-agent debate (MAD) lacks a unified, cross-modal evaluation framework, making it difficult to fairly assess performance across diverse domains and modalities. To address this gap, this work proposes M3MAD-Bench—the first comprehensive MAD benchmark that spans five key domains (knowledge, mathematics, medicine, natural sciences, and complex reasoning) and supports both textual and vision-language multimodal tasks. Built upon a structured multi-agent debate framework, the benchmark enables systematic evaluation across nine heterogeneous foundation models along multiple dimensions, including accuracy, robustness, and efficiency (measured by token consumption and inference time). Experimental results reveal the effectiveness boundaries, robustness variations, and efficiency trade-offs of MAD in multimodal settings, establishing a reliable and comparable foundation for future research.
📝 Abstract
As an agent-level reasoning and coordination paradigm, Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single-modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD-Bench, a unified and extensible benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics. M3MAD-Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision-language datasets, enabling controlled cross-modality comparison. We evaluate MAD methods on nine base models spanning different architectures, scales, and modality capabilities. Beyond accuracy, M3MAD-Bench incorporates efficiency-oriented metrics such as token consumption and inference time, providing a holistic view of performance--cost trade-offs. Extensive experiments yield systematic insights into the effectiveness, robustness, and efficiency of MAD across text-only and multimodal scenarios. We believe M3MAD-Bench offers a reliable foundation for future research on standardized MAD evaluation. The code is available at http://github.com/liaolea/M3MAD-Bench.