Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This paper identifies a “under-reasoning–over-reasoning” bimodal misalignment in large language models (LLMs) regarding reasoning-length control: LLMs tend to over-reason on simple problems while under-reasoning on complex ones, revealing deficits in difficulty awareness and adaptive generation-length control. To address this, we propose a difficulty-aware preference optimization framework that treats generation length as a proxy signal for reasoning difficulty—e.g., length-constrained variants of DPO—integrated with empirically grounded statistical modeling of reasoning length distributions and difficulty estimation. Evaluated across multiple reasoning benchmarks, our method achieves 30–50% average reduction in generated token count with less than 2% accuracy degradation. This work provides the first empirical validation that reasoning length is a reliable, optimizable behavioral signal of reasoning quality, establishing a new paradigm for controllable reasoning modeling.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer responses can sometimes degrade accuracy rather than improve it. In this paper, we conduct a systematic empirical study of the relationship between reasoning length and answer correctness. We find that LLMs tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones, failing to extend their reasoning when it is most needed. This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately. Furthermore, we investigate the effects of length reduction with a preference optimization algorithm when simply preferring the shorter responses regardless of answer correctness. Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy. Our findings highlight generation length as a meaningful signal for reasoning behavior and motivate further exploration into LLMs' self-awareness in reasoning length adaptation.

Problem

Research questions and friction points this paper is trying to address.

Study relationship between reasoning length and answer correctness in LLMs

LLMs overthink simple problems and underthink hard ones

Investigate effects of length reduction on maintaining accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical study on reasoning length and correctness

Preference optimization for shorter response lengths

Generation length as signal for reasoning behavior

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting