Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation benchmarks inadequately assess the collaborative capabilities of large language models (LLMs) in multi-agent systems (LLM-MAS), particularly at the fine-grained process level. Method: We introduce Collab-Overcooked—the first benchmark explicitly designed for evaluating step-by-step collaboration—built upon Overcooked-AI to support natural-language-driven, real-time cooperative tasks. It features a novel “process-oriented” evaluation metric suite and an open-source multi-agent collaboration framework enabling diverse goals and language-mediated interaction. Contribution/Results: Systematic evaluation across 30 open-ended tasks reveals that while current LLMs exhibit strong goal comprehension, they suffer from critical bottlenecks in proactive collaboration, dynamic adaptation, and sustained coordination. This work establishes a new paradigm, toolkit, and empirical foundation for quantifying and advancing LLM collaboration capabilities.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-powered Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks from two novel perspectives. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments over 10 popular LLMs and show that, while the LLMs present a strong ability in goal interpretation, there is a significant discrepancy in active collaboration and continuous adaption that are critical for efficiently fulfilling complicated tasks. Notably, we highlight the strengths and weaknesses in LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-sourced benchmark. Environments, 30 open-ended tasks, and an integrated evaluation package are now publicly available at https://github.com/YusaeMeow/Collab-Overcooked.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLM-based multi-agent systems
Assessing collaboration in interactive environments
Developing process-oriented evaluation metrics for LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-powered Multi-Agent System
natural language collaboration
process-oriented evaluation metrics
🔎 Similar Papers
No similar papers found.
Haochen Sun
Haochen Sun
Beijing University of Posts and Telecommunications
Large Language ModelMulti-Agent System
S
Shuwen Zhang
Beijing University of Posts and Telecommunications
Lei Ren
Lei Ren
Li Auto
NLP、LLM、VLM
H
Hao Xu
Li Auto Inc.
H
Hao Fu
Li Auto Inc.
C
Caixia Yuan
Beijing University of Posts and Telecommunications
X
Xiaojie Wang
Beijing University of Posts and Telecommunications