How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

📅 2024-03-18

🏛️ arXiv.org

📈 Citations: 19

✨ Influential: 1

career value

209K/year

🤖 AI Summary

Existing evaluation methods for large language model (LLM)-based multi-agent decision-making are limited to two-player games and suffer from test-set leakage, undermining validity and generalizability. Method: We propose GAMA(γ)-Bench—a novel benchmark framework covering eight canonical multi-agent game scenarios. It introduces the first dynamic participation mechanism with tunable parameters and an adaptive, strategy-quality-based quantitative scoring system designed to prevent data leakage. Leveraging game-theoretic modeling, dynamic scoring algorithms, and Chain-of-Thought (CoT) prompt optimization, we conduct a systematic cross-model evaluation protocol across 13 mainstream LLMs. Contribution/Results: Gemini-1.5-Pro achieves the highest score (69.8). Analyses reveal GPT-3.5’s strong robustness but weak generalization, while CoT consistently enhances strategic reasoning. GAMA(γ)-Bench establishes the first scalable, leakage-resistant, fine-grained benchmark for evaluating LLMs’ multi-agent decision-making capabilities.

Technology Category

Application Category

📝 Abstract

Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating Large Language Models (LLMs). Researchers have examined LLMs' decision-making through the lens of Game Theory. However, existing evaluation mainly focus on two-player scenarios where an LLM competes against another. Additionally, previous benchmarks suffer from test set leakage due to their static design. We introduce GAMA($gamma$)-Bench, a new framework for evaluating LLMs' Gaming Ability in Multi-Agent environments. It includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to quantitatively assess LLMs' performance. $gamma$-Bench allows flexible game settings and adapts the scoring system to different game parameters, enabling comprehensive evaluation of robustness, generalizability, and strategies for improvement. Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought. We also evaluate 13 LLMs from 6 model families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of $69.8$ out of $100$, followed by LLaMA-3.1-70B ($65.9$) and Mixtral-8x22B ($62.4$). Our code and experimental results are publicly available at https://github.com/CUHK-ARISE/GAMABench.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' decision-making in multi-agent games

Introduce GAMA-Bench for dynamic LLM performance assessment

Assess robustness and generalizability of 13 LLMs across games

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Game Theory evaluation

Dynamic scoring system adaptation

Comprehensive LLMs performance assessment

🔎 Similar Papers

A Survey on Large Language Model-Based Game Agents