COMMA: A Communicative Multimodal Multi-Agent Benchmark

📅 2024-10-10
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks overlook the evaluation of multimodal agents’ capabilities in language-driven collaboration, particularly lacking systematic assessment of agent–agent and agent–human coordination under information asymmetry. Method: We introduce LanCoop, the first multimodal multi-agent benchmark explicitly designed for language-mediated collaboration, featuring vision-language inputs, asymmetric-information tasks, and a structured evaluation protocol. Contribution/Results: We propose a novel four-dimensional framework for assessing collaborative competence and empirically reveal—through rigorous testing—that state-of-the-art models (e.g., GPT-4o) perform significantly worse than random baselines in pure agent–agent collaboration; performance improves only when human participation is introduced. This exposes a fundamental deficiency in their collaborative reasoning. Our findings demonstrate that current multimodal large language models lack autonomous, robust language-mediated coordination capabilities.

Technology Category

Application Category

📝 Abstract
The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of scenarios, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. By testing both agent-agent and agent-human collaborations using open-source and closed-source models, our findings reveal surprising weaknesses in state-of-the-art models, including proprietary models like GPT-4o. Some of these models struggle to outperform even a simple random agent baseline in agent-agent collaboration and only surpass the random baseline when a human is involved.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal multi-agent collaboration
Assesses language-based communication effectiveness
Identifies weaknesses in current agent models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal multi-agent benchmark
Language-based communication evaluation
Agent-agent and agent-human collaboration
🔎 Similar Papers
No similar papers found.
T
Timothy Ossowski
Department of Computer Science, University of Wisconsin, Madison
Jixuan Chen
Jixuan Chen
UC San Diego
Multimodal agentsNatural language processingMachine learning
D
Danyal Maqbool
Department of Computer Science, University of Wisconsin, Madison
Zefan Cai
Zefan Cai
Student, Peking University
Inference AccelerationMulti-Modality
Tyler J. Bradshaw
Tyler J. Bradshaw
Associate Professor, University of Wisconsin - Madison
Machine learningnuclear medicinelarge language modelsmultimodal vision-language
J
Junjie Hu
Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison