🤖 AI Summary
This work addresses a critical limitation in conventional chain-of-thought (CoT) reasoning under fixed output-length constraints: because reasoning steps and the final answer share a common token budget, lengthy reasoning chains often encroach upon the space allocated for the answer, paradoxically degrading accuracy. The study is the first to identify and quantify this “coupling penalty,” proposing instead to allocate separate token budgets for reasoning and answer generation. To predict performance inflection points, the authors develop a truncation–waste decomposition model. Experiments on GSM8K, MATH-500, and BIG-Bench Hard using Qwen3 and DeepSeek-R1-Distill-Llama-8B demonstrate that the proposed decoupled budgeting mechanism substantially improves performance—achieving 83.6% accuracy on MATH-500—and reveals that non-thinking modes can outperform CoT under tight token budgets.
📝 Abstract
Chain-of-thought reasoning is often treated as a monotone way to improve language-model accuracy by letting a model think longer. We identify a countervailing effect, the coupling tax: when reasoning traces and final answers share one output-token budget, long traces can crowd out the answer they are meant to support. Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales, non-thinking mode matches or outperforms thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while harder tasks shift the crossover to larger budgets. We derive a truncation-waste decomposition, $\mathrm{Acc}_{\mathrm{think}}(b)=α_c F_L(b)+α_t(1-F_L(b))$, that predicts this crossover from chain-length and accuracy statistics and explains inverse scaling within the Qwen family. A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface. As a mitigation, split-budget generation decouples reasoning and answer budgets; on full MATH-500, IRIS reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and a fixed non-oracle SC+IRIS gate reaches 83.6%. The results show that test-time reasoning should be evaluated as a budget-allocation problem, not only as a question of whether longer traces are available.