ChessQA: Evaluating Large Language Models for Chess Understanding

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluations of large language models’ (LLMs) chess capabilities lack systematicity and hierarchical granularity, failing to characterize performance disparities across abstraction levels—e.g., structural understanding, tactical pattern recognition, short-term calculation, positional evaluation, and semantic description. To address this, we propose ChessBench, the first hierarchical, adaptive benchmark grounded in chess rules and human learning progression. It comprises a multi-category test suite with automated puzzle generation, canonical answer keys, and configurable prompting to enable continuous updates alongside model evolution. Experiments reveal pervasive cross-task inconsistency among mainstream LLMs in chess-specific reasoning. We open-source the evaluation framework, leaderboards, and regularly updated datasets, establishing a new paradigm for domain-specific capability assessment in LLMs.

Technology Category

Application Category

📝 Abstract

Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs), as it has well-defined structure and objective ground truth while admitting a wide spectrum of skill levels. However, existing evaluations of LLM ability in chess are ad hoc and narrow in scope, making it difficult to accurately measure LLM chess understanding and how it varies with scale, post-training methodologies, or architecture choices. We present ChessQA, a comprehensive benchmark that assesses LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, and Semantic), which approximately correspond to the ascending abstractions that players master as they accumulate chess knowledge, from understanding basic rules and learning tactical motifs to correctly calculating tactics, evaluating positions, and semantically describing high-level concepts. In this way, ChessQA captures a more comprehensive picture of chess ability and understanding, going significantly beyond the simple move quality evaluations done previously, and offers a controlled, consistent setting for diagnosis and comparison. Furthermore, ChessQA is inherently dynamic, with prompts, answer keys, and construction scripts that can evolve as models improve. Evaluating a range of contemporary LLMs, we find persistent weaknesses across all five categories and provide results and error analyses by category. We will release the code, periodically refreshed datasets, and a public leaderboard to support further research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM chess understanding across multiple abstraction levels

Addressing ad hoc narrow scope in existing chess evaluations

Providing dynamic benchmark for diagnosing chess reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops comprehensive benchmark for chess understanding

Evaluates models across five ascending abstraction categories

Provides dynamic dataset with evolving prompts and scripts

🔎 Similar Papers

No similar papers found.

Authors to Follow