Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) suffer from “overthinking”—inefficiently generating explicit chain-of-thought (CoT) reasoning even for simple problems, degrading both accuracy and inference efficiency. Method: We propose AutoThink, a multi-stage reinforcement learning framework that endows LRMs with adaptive reasoning: dynamically deciding whether to produce an explicit CoT based on problem complexity. Its core innovations include (i) leveraging ellipsis-based prompting to elicit implicit yet controllable reasoning, and (ii) integrating reward shaping with dynamic gating over reasoning paths. Contribution/Results: Evaluated on the R1-distilled model (DeepSeek-R1-Distill-Qwen-1.5B), AutoThink achieves a 6.4% absolute accuracy gain and 52% token reduction across five major mathematical benchmarks. It is the first work to enable end-to-end, demand-driven reasoning policy learning in LRMs, establishing a new paradigm for efficient, context-aware reasoning.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs.
Problem

Research questions and friction points this paper is trying to address.

Enable LRMs to dynamically decide reasoning based on complexity
Reduce computational overhead from unnecessary step-by-step reasoning
Optimize accuracy-efficiency trade-offs via adaptive thinking policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage RL optimizes adaptive reasoning policies
Ellipsis in prompts triggers controllable reasoning modes
AutoThink balances accuracy and efficiency dynamically
🔎 Similar Papers
No similar papers found.
Songjun Tu
Songjun Tu
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory
Large Language ModelsReinforecement Learning
J
Jiahao Lin
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory; School of Artificial Intelligence, University of Chinese Academy of Sciences
Qichao Zhang
Qichao Zhang
中国科学院自动化研究所
人工智能 强化学习 博弈论 自适应动态规划
X
Xiangyu Tian
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory; School of Artificial Intelligence, University of Chinese Academy of Sciences
L
Linjing Li
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory; School of Artificial Intelligence, University of Chinese Academy of Sciences
Xiangyuan Lan
Xiangyuan Lan
Pengcheng Laboratory
Multimodal LLMPlace RecognitionVisual TrackingPerson Re-identificationObject Detection
Dongbin Zhao
Dongbin Zhao
Institute of Automation, Chinese Academy of Sciences
Deep Reinforcement LearningAdaptive Dynamic ProgrammingGame AISmart drivingrobotics