🤖 AI Summary
This paper identifies a pervasive “underthinking” phenomenon in large language models (e.g., OpenAI o1) during complex mathematical reasoning: excessive switching between reasoning paths leads to insufficient depth per chain and degraded accuracy. To address this, we formally define and quantify underthinking for the first time, and propose TIP (Thought-In-Path), a fine-tuning-free decoding strategy that penalizes path switching during generation—thereby enforcing longer token-level exploration within each reasoning path. Leveraging a token-efficiency-based underthinking metric and evaluation across multiple mathematical benchmarks (e.g., MATH, AMC23), TIP achieves significant accuracy gains (+4.2% average), improved robustness, and consistent depth–accuracy trade-off optimization. Our approach introduces a lightweight, general-purpose, and interpretable paradigm for enhancing deep reasoning in foundation models—without architectural modification or parameter updates.
📝 Abstract
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.