🤖 AI Summary
This study investigates how “reasoning budget”—i.e., the number of reasoning steps—affects large language models’ (LLMs) inference performance, challenging the implicit assumption that more budget invariably improves performance. Method: We systematically evaluate diverse configurations of chain-of-thought (CoT), self-consistency (SC), and self-reflection (SR) on standardized reasoning benchmarks. Contribution/Results: We find diminishing marginal returns from extending CoT length, accompanied by substantial computational overhead. Crucially, we propose a “strategy synergy over budget stacking” paradigm: integrating SC and SR under low reasoning budgets consistently outperforms high-budget baselines across multiple tasks—achieving Pareto improvements in both accuracy and computational efficiency. This yields a reproducible, low-cost pathway for efficient LLM inference.
📝 Abstract
Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget, impacts model performance. In this work, we propose a systematic investigation of the thinking budget as a key parameter, examining its interaction with various configurations such as self-consistency, reflection, and others. Our goal is to provide an informative, balanced comparison framework that considers both performance outcomes and computational cost. Among our findings, we discovered that simply increasing the thinking budget is not the most effective use of compute. More accurate responses can instead be achieved through alternative configurations, such as self-consistency and self-reflection.