LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient evaluation of large language models’ (LLMs) generalization in reasoning and instruction-following for chess tasks. To this end, we propose ChessEval—the first dynamic adversarial evaluation framework for LLMs in chess. It incorporates randomized opponents, variable-difficulty engine matches, and behavioral trajectory analysis to mitigate overfitting and saturation inherent in static benchmarks. Using multidimensional metrics—including win rate, move quality, action legality, hallucination rate, and Elo estimation—we systematically evaluate over 50 mainstream models. Results reveal that even state-of-the-art models struggle to consistently complete games or achieve victory. We publicly release the evaluation framework, a real-time leaderboard, and a high-quality chess game dataset. This establishes a reproducible, scalable, and structured-reasoning–focused paradigm for assessing LLM capabilities in complex, rule-governed domains.

Technology Category

Application Category

📝 Abstract
We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs) through extended agentic interaction in the domain of chess. We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including win and loss rates, move quality, move legality, hallucinated actions, and game duration. For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way. Despite the simplicity of the instruction-following task and the weakness of the opponent, many state-of-the-art models struggle to complete games or achieve consistent wins. Similar to other benchmarks on complex reasoning tasks, our experiments reveal a clear separation between reasoning and non-reasoning models. However, unlike existing static benchmarks, the stochastic and dynamic nature of LLM CHESS uniquely reduces overfitting and memorization while preventing benchmark saturation, proving difficult even for top reasoning models. To support future work on evaluating reasoning and instruction-following in LLMs, we release our experimental framework, a public leaderboard, and a dataset of associated games.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM reasoning and instruction-following via chess interactions
Ranks models using behavioral metrics and Elo ratings against engines
Reduces overfitting with dynamic tasks, revealing reasoning model limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLMs via extended chess agentic interactions
Ranking models using behavioral metrics and Elo estimates
Reducing overfitting with stochastic dynamic evaluation framework
🔎 Similar Papers
No similar papers found.