When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work challenges the prevailing assumption that longer reasoning chains invariably improve large language model performance, revealing for the first time that excessive reasoning—termed “overthinking”—can cause models to abandon initially correct answers and degrade accuracy. To address this, the authors introduce a cost-aware evaluation framework that adaptively allocates computational resources based on problem difficulty through test-time compute scaling, marginal gain analysis, and a dynamic stopping mechanism. Empirical results demonstrate that near-optimal accuracy can be achieved with moderate reasoning budgets, substantially reducing computational overhead. These findings indicate that employing a uniform, fixed reasoning length is suboptimal, and adaptive computation offers a more efficient and effective alternative.

Technology Category

Application Category

📝 Abstract

Scaling test-time compute through extended chains of thought has become a dominant paradigm for improving large language model reasoning. However, existing research implicitly assumes that longer thinking always yields better results. This assumption remains largely unexamined. We systematically investigate how the marginal utility of additional reasoning tokens changes as compute budgets increase. We find that marginal returns diminish substantially at higher budgets and that models exhibit ``overthinking'', where extended reasoning is associated with abandoning previously correct answers. Furthermore, we show that optimal thinking length varies across problem difficulty, suggesting that uniform compute allocation is suboptimal. Our cost-aware evaluation framework reveals that stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.

Problem

Research questions and friction points this paper is trying to address.

overthinking

test-time compute scaling

chain-of-thought reasoning

marginal utility

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time compute scaling

overthinking

chain-of-thought reasoning