CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

πŸ“… 2026-04-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current evaluations of large language models (LLMs) are often confined to isolated cognitive tasks or artificially constructed puzzles, failing to adequately assess their capacity for creative problem solving in real-world scenarios. This work proposes the first comprehensive benchmark grounded in real-world knowledge, which requires models to integrate logical reasoning, lateral thinking, analogical reasoning, and commonsense knowledge to uncover non-obvious connections across diverse factual domains. Systematic evaluation of state-of-the-art and non-agentic LLMs reveals a significant performance gapβ€”up to 17%β€”between creative and factual question answering, indicating that while models can effectively retrieve knowledge, they struggle to forge cross-domain creative links. This benchmark addresses critical gaps in existing evaluations by emphasizing both realism and integrative cognitive demands.
πŸ“ Abstract
Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOWLve remains highly challenging. Our analysis reveals a consistent performance gap: models perform substantially better on factual questions than on creative ones (up to a -17% drop). While models can often retrieve the relevant knowledge, they struggle to form the non-obvious creative connections required to integrate this information and arrive at the correct answer.
Problem

Research questions and friction points this paper is trying to address.

creative problem-solving
large language models
real-world knowledge
benchmarking
cognitive abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

creative problem-solving
real-world knowledge
large language models
reasoning benchmark
lateral thinking
πŸ”Ž Similar Papers
No similar papers found.