Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

📅 2024-06-04

🏛️ arXiv.org

📈 Citations: 19

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work reveals a severe generalization collapse in state-of-the-art large language models (e.g., GPT-4, Claude 3 Opus) on minimalist commonsense arithmetic tasks (AIW problems): even under zero-shot settings with syntactically invariant, semantically unambiguous prompts, model accuracy falls significantly below human performance and exhibits high sensitivity to superficial template perturbations. Method: We introduce a novel, natural-language-formalized micro-benchmark, coupled with controlled template ablation experiments and multi-turn self-correction prompting, to systematically expose structural reasoning failures and hallucinatory error explanations. Contribution/Results: Empirical results demonstrate that standard benchmarks substantially underestimate fundamental reasoning deficits; canonical prompting strategies—including chain-of-thought and self-correction—fail to mitigate these flaws; and current evaluation practices suffer from systemic biases in capability assessment. This work establishes a new paradigm for rigorous LLM reasoning evaluation and issues a critical caution against overestimating compositional and deductive competence in contemporary foundation models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are often described as instances of foundation models that possess strong generalization obeying scaling laws, and therefore transfer robustly across various conditions in few- or zero-shot manner. Such claims rely on standardized benchmarks that suppose to measure generalization and reasoning, where state-of-the-art (SOTA) models score high. We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models claiming strong function, including large scale advanced models like GPT-4 or Claude 3 Opus, using a simple, short common sense math problem formulated in concise natural language, easily solvable by humans (AIW problem). The breakdown is dramatic as it manifests on a simple problem in both low average performance and strong performance fluctuations on natural variations in problem template that do not change either problem structure or its difficulty at all. By testing models on further control problems with similar form, we rule out that breakdown might be rooted in minor low-level issues like natural language or numbers parsing. We also observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations. Various standard interventions in an attempt to get the right solution, like chain-of-thought prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We use these observations to stimulate re-assessment of the capabilities of current generation of LLMs as claimed by standardized benchmarks. Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such deficits in generalization and reasoning that obviously remain undiscovered by current state-of-the-art evaluation procedures, where SOTA LLMs manage to score high. Code: https://github.com/LAION-AI/AIW

Problem

Research questions and friction points this paper is trying to address.

Demonstrates reasoning breakdown in SOTA LLMs on simple tasks.

Highlights overconfidence and failure in solving basic math problems.

Calls for re-assessment of LLM capabilities and benchmark standards.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simple math problem tests LLM reasoning

Identifies generalization breakdown in SOTA models

Highlights need for new evaluation benchmarks

🔎 Similar Papers

No similar papers found.