🤖 AI Summary
This study investigates the capacity of mainstream large language models (LLMs)—including GPT-4, Claude-3, and Llama-3—to jointly solve 6×6 Sudoku puzzles and generate strategic, stepwise, human-interpretable natural language explanations, focusing on explainability rather than mere answer correctness.
Method: We conduct zero-shot and few-shot prompting experiments, evaluating both solution accuracy and explanation quality via human assessment and logical consistency analysis.
Contribution/Results: Only one model demonstrates baseline puzzle-solving capability; none reliably produce explanations reflecting heuristic strategies, incremental reasoning, or cognitive accessibility. To our knowledge, this is the first empirical study to rigorously assess explanation quality—specifically, strategic interpretability—in structured reasoning tasks. Our findings expose a fundamental limitation in current LLMs’ ability to articulate deliberate, pedagogically sound reasoning processes. The work establishes a novel evaluation benchmark for trustworthy human-AI collaborative decision-making, emphasizing transparency, strategy awareness, and explanatory fidelity over output correctness alone.
📝 Abstract
The success of Large Language Models (LLMs) in human-AI collaborative decision-making hinges on their ability to provide trustworthy, gradual, and tailored explanations. Solving complex puzzles, such as Sudoku, offers a canonical example of this collaboration, where clear and customized explanations often hold greater importance than the final solution. In this study, we evaluate the performance of five LLMs in solving and explaining sixsix{} Sudoku puzzles. While one LLM demonstrates limited success in solving puzzles, none can explain the solution process in a manner that reflects strategic reasoning or intuitive problem-solving. These findings underscore significant challenges that must be addressed before LLMs can become effective partners in human-AI collaborative decision-making.