Frontier LLMs Still Struggle with Simple Reasoning Tasks

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

State-of-the-art large language models (LLMs) exhibit systematic failures on reasoning tasks deemed “simple” by humans—including counting, first-order logic, proof-tree construction, and travel planning—revealing fundamental deficits in out-of-distribution generalization. Method: The authors introduce a programmatically generated reasoning benchmark with tunable computational complexity and propose “unpuzzles”: simplified variants of classic mathematical and logical puzzles. Contribution/Results: Experiments show that while LLMs perform well on original puzzles, their performance degrades significantly on simplified versions—indicating overreliance on memorized patterns and statistical shortcuts rather than genuine structural reasoning. Moreover, accuracy drops sharply as task complexity increases. This work is the first to systematically identify, characterize, and quantify the “simplification-induced failure” phenomenon, establishing a novel diagnostic framework and robustness benchmark for evaluating compositional and systematic reasoning in LLMs.

Technology Category

Application Category

📝 Abstract

While state-of-the-art large language models (LLMs) demonstrate advanced reasoning capabilities-achieving remarkable performance on challenging competitive math and coding benchmarks-they also frequently fail on tasks that are easy for humans. This work studies the performance of frontier LLMs on a broad set of such "easy" reasoning problems. By extending previous work in the literature, we create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning, with changeable parameters (such as document length. or the number of variables in a math problem) that can arbitrarily increase the amount of computation required to produce the answer while preserving the fundamental difficulty. While previous work showed that traditional, non-thinking models can be made to fail on such problems, we demonstrate that even state-of-the-art thinking models consistently fail on such problems and for similar reasons (e.g. statistical shortcuts, errors in intermediate steps, and difficulties in processing long contexts). To further understand the behavior of the models, we introduce the unpuzzles dataset, a different "easy" benchmark consisting of trivialized versions of well-known math and logic puzzles. Interestingly, while modern LLMs excel at solving the original puzzles, they tend to fail on the trivialized versions, exhibiting several systematic failure patterns related to memorizing the originals. We show that this happens even if the models are otherwise able to solve problems with different descriptions but requiring the same logic. Our results highlight that out-of-distribution generalization is still problematic for frontier language models and the new generation of thinking models, even for simple reasoning tasks, and making tasks easier does not necessarily imply improved performance.

Problem

Research questions and friction points this paper is trying to address.

Frontier LLMs fail on simple human-easy reasoning tasks

Models struggle with trivialized versions of known puzzles

Out-of-distribution generalization remains challenging for LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Procedurally generated simple reasoning tasks suite

Unpuzzles dataset for trivialized benchmark testing

Analysis of systematic LLM failure patterns

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?