Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently exhibit “reasoning overconfidence” in multi-solution tasks, overlooking viable solutions due to an inability to assess answer completeness. Method: We propose the “cognitive rigidity” hypothesis and introduce MuSoBench—the first benchmark explicitly designed for multi-solution reasoning. We develop short-chain and long-chain chain-of-thought (Long-CoT) prompting strategies, integrating attention entropy to quantify cognitive limitations during inference. Contribution/Results: Empirical evaluation shows that Long-CoT significantly mitigates overconfidence through iterative exploration, improving solution diversity and completeness—achieving a +32.7% gain in coverage. This work shifts LLM evaluation paradigms from correctness-centric to completeness-aware assessment, providing a novel theoretical framework, a dedicated benchmark (MuSoBench), and new methodology for modeling and evaluating multi-solution reasoning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to extbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce extit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the extbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.
Problem

Research questions and friction points this paper is trying to address.

Addresses overconfidence in LLMs for multi-solution tasks
Introduces a benchmark to evaluate reasoning completeness
Proposes a hypothesis on premature convergence in reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MuSoBench benchmark for multi-solution tasks
Uses Long-CoT prompting to reduce reasoning overconfidence
Proposes cognitive-rigidity hypothesis to explain overconfidence
🔎 Similar Papers
No similar papers found.
J
Jiannan Guan
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology
Qiguang Chen
Qiguang Chen
Harbin Institute of Technology
Chain-of-ThoughtReasoningMultilingual LLMMulti-modal LLM
L
Libo Qin
School of Computer Science and Engineering, Central South University
Dengyun Peng
Dengyun Peng
Harbin Institute of Technology
Jinhao Liu
Jinhao Liu
Harbin Institute of Technology
Chain-of-ThoughtReasoningNatural Language Processing
L
Liangyu Huo
Du Xiaoman (Beijing) Science Technology Co., Ltd.
J
Jian Xie
Du Xiaoman (Beijing) Science Technology Co., Ltd.
Wanxiang Che
Wanxiang Che
Professor of Harbin Institute of Technology
Natural Language Processing