Reasoning Capabilities and Invariability of Large Language Models

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This paper addresses the strong prompt dependency and poor robustness of large language models (LLMs) on shallow logical reasoning tasks by introducing GEO-REASON—the first cognitive psychology–driven geometric reasoning benchmark, rigorously controlling for prior knowledge interference. Methodologically, it conducts systematic evaluation across 24 models of varying scales using zero-shot, few-shot, and chain-of-thought (CoT) prompting, augmented by a novel temporal sensitivity analysis framework. Key contributions include: (1) uncovering a model-dependent CoT reversal phenomenon—where CoT degrades performance by up to 32% for certain models; (2) demonstrating that even 70B+ parameter models achieve peak zero-shot accuracy yet remain substantially below human baselines; and (3) establishing a cognitively aligned, structured evaluation paradigm that sets a new standard for assessing LLM reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown remarkable capabilities in manipulating natural language across multiple applications, but their ability to handle simple reasoning tasks is often questioned. In this work, we aim to provide a comprehensive analysis of LLMs' reasoning competence, specifically focusing on their prompt dependency. In particular, we introduce a new benchmark dataset with a series of simple reasoning questions demanding shallow logical reasoning. Aligned with cognitive psychology standards, the questions are confined to a basic domain revolving around geometric figures, ensuring that responses are independent of any pre-existing intuition about the world and rely solely on deduction. An empirical analysis involving zero-shot and few-shot prompting across 24 LLMs of different sizes reveals that, while LLMs with over 70 billion parameters perform better in the zero-shot setting, there is still a large room for improvement. An additional test with chain-of-thought prompting over 22 LLMs shows that this additional prompt can aid or damage the performance of models, depending on whether the rationale is required before or after the answer.

Problem

Research questions and friction points this paper is trying to address.

Analyzing LLMs' reasoning competence and prompt dependency

Evaluating performance on simple logical reasoning tasks

Assessing impact of chain-of-thought prompting on LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces benchmark dataset for shallow logical reasoning

Analyzes 24 LLMs using zero-shot and few-shot prompting

Tests chain-of-thought prompting impact on 22 LLMs

🔎 Similar Papers

No similar papers found.