When More is Less: Understanding Chain-of-Thought Length in LLMs

๐Ÿ“… 2025-02-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates the relationship between chain-of-thought (CoT) reasoning length and inference accuracy in large language models, revealing a non-monotonic pattern: accuracy first improves and then degrades with increasing step count, exhibiting an optimal CoT length. The authors provide the first theoretical proof of the existence of such an optimum and derive scaling laws characterizing how it varies with model capability and task difficulty. To automatically identify effective reasoning lengths, they propose Length-filtered Voteโ€”a voting-based method that selects the most consistent answer across candidate CoT lengths. Extensive experiments on synthetic benchmarks and real-world tasks (e.g., GSM8K, MMLU) confirm the universality of this phenomenon and demonstrate substantial gains over standard CoT and fixed-length long-CoT baselines. The core contribution lies in uncovering the โ€œnoiseโ€“benefitโ€ trade-off inherent in CoT length selection, establishing a dynamic calibration paradigm that provides both theoretical foundations and practical tools for controllable, length-aware reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs) by breaking complex tasks into smaller, manageable sub-tasks. Researchers have been exploring ways to guide models to generate more complex CoT processes to improve the reasoning ability of LLMs, such as long CoT and the test-time scaling law. However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy? In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases. To understand this phenomenon, we provide a piece of evidence that longer reasoning processes are increasingly susceptible to noise. We theoretically prove the existence of an optimal CoT length and derive a scaling law for this optimal length based on model capability and task difficulty. Inspired by our theory, we conduct experiments on both synthetic and real world datasets and propose Length-filtered Vote to alleviate the effects of excessively long or short CoTs. Our findings highlight the critical need to calibrate CoT length to align with model capabilities and task demands, offering a principled framework for optimizing multi-step reasoning in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Optimal CoT length determination
Noise susceptibility in long CoTs
Calibration of CoT with model-task fit
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Chain-of-Thought length
Length-filtered Vote method
Scaling law based analysis
๐Ÿ”Ž Similar Papers
No similar papers found.