🤖 AI Summary
Existing benchmarks overlook the evaluation of multimodal agents’ capabilities in language-driven collaboration, particularly lacking systematic assessment of agent–agent and agent–human coordination under information asymmetry. Method: We introduce LanCoop, the first multimodal multi-agent benchmark explicitly designed for language-mediated collaboration, featuring vision-language inputs, asymmetric-information tasks, and a structured evaluation protocol. Contribution/Results: We propose a novel four-dimensional framework for assessing collaborative competence and empirically reveal—through rigorous testing—that state-of-the-art models (e.g., GPT-4o) perform significantly worse than random baselines in pure agent–agent collaboration; performance improves only when human participation is introduced. This exposes a fundamental deficiency in their collaborative reasoning. Our findings demonstrate that current multimodal large language models lack autonomous, robust language-mediated coordination capabilities.
📝 Abstract
The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of scenarios, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. By testing both agent-agent and agent-human collaborations using open-source and closed-source models, our findings reveal surprising weaknesses in state-of-the-art models, including proprietary models like GPT-4o. Some of these models struggle to outperform even a simple random agent baseline in agent-agent collaboration and only surpass the random baseline when a human is involved.