Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) commonly suffer from overthinking in chain-of-thought (CoT) reasoning—generating redundant, uninformative steps that increase computational cost and degrade accuracy. Method: We propose a training-free, test-time adaptive early-stopping framework that treats “when to stop” as an extendable reasoning dimension. Our approach introduces, for the first time, an unsupervised two-stage stopping discriminator coupled with a sliding-window multi-armed bandit controller. It integrates reflective redundancy detection with Upper Confidence Bound (UCB)-based dynamic threshold adjustment. Contribution/Results: Evaluated across four mainstream benchmarks and two major LLM families, our method reduces token consumption by 20%–55% on average while preserving or improving accuracy. It demonstrates strong robustness and cross-task generalization without task-specific fine-tuning or additional model parameters.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning -- so-called overthinking -- can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN ($underline{REF}$lective-$underline{R}$edundancy for $underline{A}$daptive $underline{IN}$ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling -- enabling models to reason not just more, but just enough.
Problem

Research questions and friction points this paper is trying to address.

Adaptively stopping reasoning to mitigate overthinking
Reducing token usage while maintaining accuracy
Determining optimal stopping thresholds without supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework adaptively stops reasoning
Two-stage discriminator identifies redundant reasoning steps
Sliding-window bandit controller adjusts stopping thresholds dynamically
🔎 Similar Papers
No similar papers found.
Renliang Sun
Renliang Sun
University of California, Los Angeles
Natural Language ProcessingLarge Language Models
W
Wei Cheng
NEC Labs America
D
Dawei Li
Arizona State University
H
Haifeng Chen
NEC Labs America
W
Wei Wang
UCLA