ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) possess genuine strategic reasoning capabilities or merely rely on pattern matching from training data. Method: We introduce ChessArena—the first strategic reasoning benchmark platform tailored for LLMs—featuring a chess environment simulator, multi-agent competition, rule-constrained inference engine, structured puzzle suite, and automated ranking algorithms. It supports multimodal gameplay and fine-grained capability decomposition (e.g., long-horizon planning, rule internalization, multi-turn memory). Contribution/Results: Evaluating 13 mainstream LLMs across 800+ games, we find most underperform amateur-level Maia-1100 and some even fall below random play. Instruction-tuned Qwen3-8B shows marked improvement, approaching the performance of larger state-of-the-art reasoning models. This work provides the first systematic characterization of LLMs’ fundamental limitations in strategic reasoning and establishes a reproducible, scalable evaluation paradigm.

Technology Category

Application Category

📝 Abstract
Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating strategic reasoning capabilities of large language models
Assessing chess-based long-term planning and rule comprehension
Testing LLM performance in competitive multi-turn game environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chess testbed evaluates strategic reasoning capabilities
Competitive framework with four different play modes
Fine-tuned Qwen3-8B model improves chess performance
🔎 Similar Papers
No similar papers found.
J
Jincheng Liu
State Key Laboratory of Novel Software Technology, Nanjing University
S
Sijun He
Baidu, Inc
J
Jingjing Wu
Baidu, Inc
X
Xiangsen Wang
Baidu, Inc
Y
Yang Chen
School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC)
Z
Zhaoqi Kuang
School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC)
Siqi Bao
Siqi Bao
Baidu
Natural Language ProcessingMedical Image Analysis
Y
Yuan Yao
State Key Laboratory of Novel Software Technology, Nanjing University