🤖 AI Summary
Existing LLM-based multi-agent benchmarks predominantly focus on single-agent or narrow-domain tasks, lacking systematic evaluation of collaborative and competitive dynamics. Method: We introduce the first comprehensive benchmark specifically designed to assess LLM-based multi-agent collaboration and competition capabilities, covering diverse interaction scenarios. Evaluation is conducted along two dimensions: task completion and quality of collaboration/competition. We propose a novel milestone-based KPI evaluation framework supporting multiple coordination topologies—including star, chain, tree, and graph structures—and incorporate new strategies: group deliberation and cognitive planning. Contribution/Results: Experiments show that GPT-4o-mini achieves the highest average task score. Graph-structured coordination yields optimal performance in research-oriented scenarios. Cognitive planning improves milestone attainment rate by 3%, demonstrating its efficacy in enhancing goal-directed multi-agent behavior.
📝 Abstract
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.