Does Your Reasoning Model Implicitly Know When to Stop Thinking?

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large reasoning models often suffer from inefficiency and even degraded accuracy on complex tasks due to the generation of excessively long and unproductive chains of thought. This work reveals, for the first time, that such models inherently possess a latent capability to β€œstop reasoning at the right time,” and introduces the Self-Aware Guided Exploration (SAGE) paradigm, which explicitly activates this ability through a hybrid sampling strategy. Building upon SAGE, the authors further develop the SAGE-RL framework by integrating population-based reinforcement learning. Experimental results demonstrate that the proposed approach significantly improves both reasoning accuracy and efficiency across multiple mathematical reasoning benchmarks, while successfully embedding this efficient reasoning mechanism into the standard pass@1 evaluation pipeline.

Technology Category

Application Category

πŸ“ Abstract
Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.
Problem

Research questions and friction points this paper is trying to address.

large reasoning models
reasoning efficiency
chain-of-thought redundancy
stopping criterion
computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Aware Reasoning
Efficient Sampling
Long Chain of Thought
Reinforcement Learning
Large Reasoning Models
πŸ”Ž Similar Papers
No similar papers found.
Z
Zixuan Huang
Beihang University
Xin Xia
Xin Xia
Bytedance Seed
Deep learning
Y
Yuxi Ren
Bytedance China
J
Jianbin Zheng
Bytedance China
X
Xuanda Wang
Bytedance China
Z
Zhixia Zhang
Beihang University
H
Hongyan Xie
Beihang University
S
Songshi Liang
Renmin University of China
Zehao Chen
Zehao Chen
PhD, Yale University
Porous MediaFluid DynamicsPolymerHydrogel
Xuefeng Xiao
Xuefeng Xiao
ByteDance Seed
Computer VisionEfficient AI
F
Fuzhen Zhuang
Beihang University
Jianxin Li
Jianxin Li
School of Computer Science & Engineering, Beihang University
Big DataAIIntelligent Computing
Yikun Ban
Yikun Ban
Beihang University, University of Illinois Urbana-Champaign
Reinforcement LearningEnsemble Learning
D
Deqing Wang
Beihang University