Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Large language models (LLMs) commonly suffer from overthinking in chain-of-thought (CoT) reasoning—generating redundant, uninformative steps that increase computational cost and degrade accuracy. Method: We propose a training-free, test-time adaptive early-stopping framework that treats “when to stop” as an extendable reasoning dimension. Our approach introduces, for the first time, an unsupervised two-stage stopping discriminator coupled with a sliding-window multi-armed bandit controller. It integrates reflective redundancy detection with Upper Confidence Bound (UCB)-based dynamic threshold adjustment. Contribution/Results: Evaluated across four mainstream benchmarks and two major LLM families, our method reduces token consumption by 20%–55% on average while preserving or improving accuracy. It demonstrates strong robustness and cross-task generalization without task-specific fine-tuning or additional model parameters.

Technology Category

Application Category

📝 Abstract

Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning -- so-called overthinking -- can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN ($underline{REF}$lective-$underline{R}$edundancy for $underline{A}$daptive $underline{IN}$ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling -- enabling models to reason not just more, but just enough.

Problem

Research questions and friction points this paper is trying to address.

Adaptively stopping reasoning to mitigate overthinking

Reducing token usage while maintaining accuracy

Determining optimal stopping thresholds without supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework adaptively stops reasoning

Two-stage discriminator identifies redundant reasoning steps

Sliding-window bandit controller adjusts stopping thresholds dynamically

🔎 Similar Papers

Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step