MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based multi-agent benchmarks predominantly focus on single-agent or narrow-domain tasks, lacking systematic evaluation of collaborative and competitive dynamics. Method: We introduce the first comprehensive benchmark specifically designed to assess LLM-based multi-agent collaboration and competition capabilities, covering diverse interaction scenarios. Evaluation is conducted along two dimensions: task completion and quality of collaboration/competition. We propose a novel milestone-based KPI evaluation framework supporting multiple coordination topologies—including star, chain, tree, and graph structures—and incorporate new strategies: group deliberation and cognitive planning. Contribution/Results: Experiments show that GPT-4o-mini achieves the highest average task score. Graph-structured coordination yields optimal performance in research-oriented scenarios. Cognitive planning improves milestone attainment rate by 3%, demonstrating its efficacy in enhancing goal-directed multi-agent behavior.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.
Problem

Research questions and friction points this paper is trying to address.

Evaluates collaboration and competition in multi-agent LLM systems.
Introduces MultiAgentBench for diverse, interactive scenario testing.
Measures task completion and interaction quality with novel KPIs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

MultiAgentBench evaluates LLM-based multi-agent systems.
Uses milestone-based KPIs for collaboration and competition.
Tests various coordination protocols and cognitive strategies.
🔎 Similar Papers
No similar papers found.