🤖 AI Summary
Large language models (LLMs) frequently exhibit “reasoning overconfidence” in multi-solution tasks, overlooking viable solutions due to an inability to assess answer completeness. Method: We propose the “cognitive rigidity” hypothesis and introduce MuSoBench—the first benchmark explicitly designed for multi-solution reasoning. We develop short-chain and long-chain chain-of-thought (Long-CoT) prompting strategies, integrating attention entropy to quantify cognitive limitations during inference. Contribution/Results: Empirical evaluation shows that Long-CoT significantly mitigates overconfidence through iterative exploration, improving solution diversity and completeness—achieving a +32.7% gain in coverage. This work shifts LLM evaluation paradigms from correctness-centric to completeness-aware assessment, providing a novel theoretical framework, a dedicated benchmark (MuSoBench), and new methodology for modeling and evaluating multi-solution reasoning.
📝 Abstract
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to extbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce extit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the extbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.