Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the strategic decision-making capabilities of large language models (LLMs) in board games grounded in game-theoretic principles. Method: We introduce the first LLM evaluation framework tailored for game-theoretic scenarios, integrated with OpenSpiel to support diverse zero-sum and non-zero-sum games. We conduct multi-dimensional comparisons between LLM agents, random policies, human players, and reinforcement learning agents. Our pipeline combines LiteLLM API abstraction with vLLM-based local inference, leverages Ray for distributed execution, and incorporates a novel interpretable reasoning trace analyzer. Contribution/Results: We provide the first empirical assessment of LLMs’ strategic rationality, counterfactual reasoning, and equilibrium-seeking behavior under rigorous game-theoretic conditions. Our findings reveal both strengths and fundamental limitations of LLMs in complex strategic interactions, establishing a foundational theoretical basis and benchmark for their trustworthy deployment in high-stakes, higher-order decision-making tasks.

Technology Category

Application Category

📝 Abstract
The Board Game Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library. The framework enables systematic comparisons between LLM based agents and other agents (random, human, reinforcement learning agents, etc.) in various game scenarios by wrapping multiple board and matrix games and supporting different agent types. It integrates API access to models via LiteLLM, local model deployment via vLLM, and offers distributed execution through Ray. Additionally it provides extensive analysis tools for the LLM reasoning traces. This paper summarizes the structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game-theoretic behavior
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM decision-making via strategic board games
Comparing LLM agents with diverse opponents systematically
Providing tools for analyzing LLM reasoning in games
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for LLM evaluation via board games
Integrates API, local, and distributed execution
Provides analysis tools for reasoning traces
🔎 Similar Papers
No similar papers found.
L
Lucia Cipolina-Kun
LAION; Juelich Supercomputing Center (JSC), Research Center Juelich (FZJ)
M
Marianna Nezhurina
LAION; Juelich Supercomputing Center (JSC), Research Center Juelich (FZJ)
Jenia Jitsev
Jenia Jitsev
Scalable Learning & Multi-Purpose AI (SLAMPAI) Lab, JSC, Forschungszentrum Juelich; ELLIS; LAION
Open Foundation Models & DatasetsScaling lawsPlasticity and Learning in Neural Networks