Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing LLM reasoning benchmarks heavily rely on pattern memorization, failing to rigorously assess genuine creative and long-horizon logical reasoning. Method: We introduce Sudoku-Bench—the first benchmark explicitly designed for evaluating creative reasoning—featuring unconventional, multi-step Sudoku variants that require strategic insight beyond rote deduction, thereby mitigating memory bias and probing models’ ability to reason under dynamic constraints. Contribution/Results: We formally define the “logical breakthrough point” as a core evaluation metric, systematically leveraging both structural commonalities across Sudoku variants and their constraint-specific uniqueness. The benchmark provides standardized textual encodings, a constraint parser, and an automated evaluation pipeline compatible with mainstream LLM APIs, supporting zero-shot, few-shot, and chain-of-thought evaluation. Experiments reveal that state-of-the-art LLMs achieve <15% unsupervised solve rates, exposing critical bottlenecks in long-horizon planning and constraint-coordinated reasoning—establishing a quantifiable benchmark for next-generation reasoning models.

Technology Category

Application Category

📝 Abstract

Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluating creative reasoning in LLMs with Sudoku variants

Addressing memorization bias in existing reasoning benchmarks

Assessing multi-step logical breakthroughs in diverse puzzles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Sudoku-Bench for creative reasoning evaluation

Uses unconventional Sudoku variants to prevent memorization

Provides standardized representation and flexible tools

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?