Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing LLM evaluation overemphasizes accuracy, failing to uncover the underlying reasoning strategies employed by models. Method: We introduce the first long-narrative riddle benchmark centered on “solution strategy diversity”—specifically contrasting creative insight versus brute-force enumeration—augmented by semantic parsing, mathematical formalization, stepwise reasoning generation, gold-answer-guided self-correction, and prompt-driven reasoning analysis. Contribution/Results: Experiments reveal that LLMs can indeed generate concise, human-like, insight-based solutions, confirming partial high-level problem-solving capability; however, they also exhibit pervasive redundancy and enumeration bias, exposing strategic preference flaws. Our framework transcends the accuracy-only paradigm, offering an interpretable, multi-layered evaluation methodology for probing LLMs’ authentic reasoning mechanisms—including solution quality, semantic comprehension, self-correction efficacy, and prompt utilization efficiency.

Technology Category

Application Category

📝 Abstract

Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning strategies of LLMs using brainteasers

Assessing creativity versus brute force in AI problem-solving

Improving LLM reasoning quality and solution efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with narrative brainteasers for deep reasoning analysis

Multi-layer reasoning evaluation including creativity and correctness

Self-correction and hint utilization to enhance solution quality

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?