🤖 AI Summary
This work addresses the challenge of multi-agent collaboration under strict partial observability, where agents must jointly construct 3D structures through natural language communication—tasks infeasible for any single agent acting alone. The authors introduce a novel multi-agent benchmark task and formalize a multi-sender pragmatic reasoning framework, establishing a diagnostic methodology that attributes collaborative failures to three distinct error categories: spatial misalignment, inaccurate belief modeling, and pragmatic communication breakdowns. Through large language model (LLM)-based simulations involving 15 models (8 open-source and 7 state-of-the-art), the study reveals that stronger individual reasoning capabilities do not necessarily translate to improved collaborative performance. Notably, certain compact open-source models match or even surpass the coordination efficacy of leading systems, underscoring fundamental limitations in current LLMs’ ability to support robust multi-agent coordination.
📝 Abstract
We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT