🤖 AI Summary
This paper investigates whether large language models (LLMs) exhibit “overthinking”—i.e., redundant, excessively long chain-of-thought (CoT) reasoning—on simple tasks, and identifies its underlying causes.
Method: We propose TRACE, a fine-grained thought trajectory analysis framework that decomposes CoT into atomic thought units and constructs a discourse-aware thought evolution graph via discourse relation modeling. TRACE identifies two canonical overthinking patterns: Explorer (exploratory redundancy) and Late Landing (delayed convergence). Building on this, we introduce a utility-based, structurally grounded definition of excessive thinking.
Contribution/Results: Experiments show that extended reasoning chains slow inference by 5–20× on simple tasks without improving accuracy. TRACE provides an interpretable, graph-structured foundation for diagnosing and mitigating overthinking, establishing a novel methodology for reasoning efficiency analysis in LLMs.
📝 Abstract
Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency -- overthinking -- models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs' inner workings. This study introduces a systematic, fine-grained analyzer of LLMs' thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.