🤖 AI Summary
Existing LLM reasoning benchmarks heavily rely on pattern memorization, failing to rigorously assess genuine creative and long-horizon logical reasoning. Method: We introduce Sudoku-Bench—the first benchmark explicitly designed for evaluating creative reasoning—featuring unconventional, multi-step Sudoku variants that require strategic insight beyond rote deduction, thereby mitigating memory bias and probing models’ ability to reason under dynamic constraints. Contribution/Results: We formally define the “logical breakthrough point” as a core evaluation metric, systematically leveraging both structural commonalities across Sudoku variants and their constraint-specific uniqueness. The benchmark provides standardized textual encodings, a constraint parser, and an automated evaluation pipeline compatible with mainstream LLM APIs, supporting zero-shot, few-shot, and chain-of-thought evaluation. Experiments reveal that state-of-the-art LLMs achieve <15% unsupervised solve rates, exposing critical bottlenecks in long-horizon planning and constraint-coordinated reasoning—establishing a quantifiable benchmark for next-generation reasoning models.
📝 Abstract
Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.