Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Large language models (LLMs) frequently exhibit “reasoning overconfidence” in multi-solution tasks, overlooking viable solutions due to an inability to assess answer completeness. Method: We propose the “cognitive rigidity” hypothesis and introduce MuSoBench—the first benchmark explicitly designed for multi-solution reasoning. We develop short-chain and long-chain chain-of-thought (Long-CoT) prompting strategies, integrating attention entropy to quantify cognitive limitations during inference. Contribution/Results: Empirical evaluation shows that Long-CoT significantly mitigates overconfidence through iterative exploration, improving solution diversity and completeness—achieving a +32.7% gain in coverage. This work shifts LLM evaluation paradigms from correctness-centric to completeness-aware assessment, providing a novel theoretical framework, a dedicated benchmark (MuSoBench), and new methodology for modeling and evaluating multi-solution reasoning.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to extbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce extit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the extbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.

Problem

Research questions and friction points this paper is trying to address.

Addresses overconfidence in LLMs for multi-solution tasks

Introduces a benchmark to evaluate reasoning completeness

Proposes a hypothesis on premature convergence in reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MuSoBench benchmark for multi-solution tasks

Uses Long-CoT prompting to reduce reasoning overconfidence

Proposes cognitive-rigidity hypothesis to explain overconfidence

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting