🤖 AI Summary
This study addresses how to predict the collaborative performance of multi-agent large language model (LLM) teams in resource-constrained scientific tasks. It pioneers the integration of behavioral economics games with real-world AI-for-Science workflows to construct a quantifiable profile of LLM collaboration tendencies. By evaluating 35 open-source LLMs across six behavioral game experiments and validating their performance in downstream multi-agent scientific tasks, the work demonstrates that collaborative propensity is a measurable trait distinct from general capabilities. Results show that highly collaborative models significantly outperform those employing greedy strategies in terms of report accuracy, output quality, and task completion rates. This predictive power remains robust even after controlling for multiple confounding factors.
📝 Abstract
Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.