When More is Less: Understanding Chain-of-Thought Length in LLMs

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work investigates the relationship between chain-of-thought (CoT) reasoning length and inference accuracy in large language models, revealing a non-monotonic pattern: accuracy first improves and then degrades with increasing step count, exhibiting an optimal CoT length. The authors provide the first theoretical proof of the existence of such an optimum and derive scaling laws characterizing how it varies with model capability and task difficulty. To automatically identify effective reasoning lengths, they propose Length-filtered Vote—a voting-based method that selects the most consistent answer across candidate CoT lengths. Extensive experiments on synthetic benchmarks and real-world tasks (e.g., GSM8K, MMLU) confirm the universality of this phenomenon and demonstrate substantial gains over standard CoT and fixed-length long-CoT baselines. The core contribution lies in uncovering the “noise–benefit” trade-off inherent in CoT length selection, establishing a dynamic calibration paradigm that provides both theoretical foundations and practical tools for controllable, length-aware reasoning.

Technology Category

Application Category

📝 Abstract

Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs) by breaking complex tasks into smaller, manageable sub-tasks. Researchers have been exploring ways to guide models to generate more complex CoT processes to improve the reasoning ability of LLMs, such as long CoT and the test-time scaling law. However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy? In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases. To understand this phenomenon, we provide a piece of evidence that longer reasoning processes are increasingly susceptible to noise. We theoretically prove the existence of an optimal CoT length and derive a scaling law for this optimal length based on model capability and task difficulty. Inspired by our theory, we conduct experiments on both synthetic and real world datasets and propose Length-filtered Vote to alleviate the effects of excessively long or short CoTs. Our findings highlight the critical need to calibrate CoT length to align with model capabilities and task demands, offering a principled framework for optimizing multi-step reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Optimal CoT length determination

Noise susceptibility in long CoTs

Calibration of CoT with model-task fit

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Chain-of-Thought length

Length-filtered Vote method

Scaling law based analysis

🔎 Similar Papers

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency