M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing research on multi-agent debate (MAD) lacks a unified, cross-modal evaluation framework, making it difficult to fairly assess performance across diverse domains and modalities. To address this gap, this work proposes M3MAD-Bench—the first comprehensive MAD benchmark that spans five key domains (knowledge, mathematics, medicine, natural sciences, and complex reasoning) and supports both textual and vision-language multimodal tasks. Built upon a structured multi-agent debate framework, the benchmark enables systematic evaluation across nine heterogeneous foundation models along multiple dimensions, including accuracy, robustness, and efficiency (measured by token consumption and inference time). Experimental results reveal the effectiveness boundaries, robustness variations, and efficiency trade-offs of MAD in multimodal settings, establishing a reliable and comparable foundation for future research.

Technology Category

Application Category

📝 Abstract

As an agent-level reasoning and coordination paradigm, Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single-modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD-Bench, a unified and extensible benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics. M3MAD-Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision-language datasets, enabling controlled cross-modality comparison. We evaluate MAD methods on nine base models spanning different architectures, scales, and modality capabilities. Beyond accuracy, M3MAD-Bench incorporates efficiency-oriented metrics such as token consumption and inference time, providing a holistic view of performance--cost trade-offs. Extensive experiments yield systematic insights into the effectiveness, robustness, and efficiency of MAD across text-only and multimodal scenarios. We believe M3MAD-Bench offers a reliable foundation for future research on standardized MAD evaluation. The code is available at http://github.com/liaolea/M3MAD-Bench.

Problem

Research questions and friction points this paper is trying to address.

Multi-Agent Debate

benchmark

multi-modality

cross-domain

evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Debate

Multimodal Benchmark

Cross-Domain Evaluation

Efficiency-Aware Metrics

Standardized Protocol

🔎 Similar Papers

COMMA: A Communicative Multimodal Multi-Agent Benchmark

2024-10-10arXiv.orgCitations: 1