๐ค AI Summary
This work investigates the relationship between chain-of-thought (CoT) reasoning length and inference accuracy in large language models, revealing a non-monotonic pattern: accuracy first improves and then degrades with increasing step count, exhibiting an optimal CoT length. The authors provide the first theoretical proof of the existence of such an optimum and derive scaling laws characterizing how it varies with model capability and task difficulty. To automatically identify effective reasoning lengths, they propose Length-filtered Voteโa voting-based method that selects the most consistent answer across candidate CoT lengths. Extensive experiments on synthetic benchmarks and real-world tasks (e.g., GSM8K, MMLU) confirm the universality of this phenomenon and demonstrate substantial gains over standard CoT and fixed-length long-CoT baselines. The core contribution lies in uncovering the โnoiseโbenefitโ trade-off inherent in CoT length selection, establishing a dynamic calibration paradigm that provides both theoretical foundations and practical tools for controllable, length-aware reasoning.
๐ Abstract
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs) by breaking complex tasks into smaller, manageable sub-tasks. Researchers have been exploring ways to guide models to generate more complex CoT processes to improve the reasoning ability of LLMs, such as long CoT and the test-time scaling law. However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy? In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases. To understand this phenomenon, we provide a piece of evidence that longer reasoning processes are increasingly susceptible to noise. We theoretically prove the existence of an optimal CoT length and derive a scaling law for this optimal length based on model capability and task difficulty. Inspired by our theory, we conduct experiments on both synthetic and real world datasets and propose Length-filtered Vote to alleviate the effects of excessively long or short CoTs. Our findings highlight the critical need to calibrate CoT length to align with model capabilities and task demands, offering a principled framework for optimizing multi-step reasoning in LLMs.