🤖 AI Summary
Large language models (LLMs) commonly suffer from overthinking in chain-of-thought (CoT) reasoning—generating redundant, uninformative steps that increase computational cost and degrade accuracy.
Method: We propose a training-free, test-time adaptive early-stopping framework that treats “when to stop” as an extendable reasoning dimension. Our approach introduces, for the first time, an unsupervised two-stage stopping discriminator coupled with a sliding-window multi-armed bandit controller. It integrates reflective redundancy detection with Upper Confidence Bound (UCB)-based dynamic threshold adjustment.
Contribution/Results: Evaluated across four mainstream benchmarks and two major LLM families, our method reduces token consumption by 20%–55% on average while preserving or improving accuracy. It demonstrates strong robustness and cross-task generalization without task-specific fine-tuning or additional model parameters.
📝 Abstract
Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning -- so-called overthinking -- can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN ($underline{REF}$lective-$underline{R}$edundancy for $underline{A}$daptive $underline{IN}$ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling -- enabling models to reason not just more, but just enough.