Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large reasoning models (LRMs) suffer from “overthinking”—inefficiently generating explicit chain-of-thought (CoT) reasoning even for simple problems, degrading both accuracy and inference efficiency. Method: We propose AutoThink, a multi-stage reinforcement learning framework that endows LRMs with adaptive reasoning: dynamically deciding whether to produce an explicit CoT based on problem complexity. Its core innovations include (i) leveraging ellipsis-based prompting to elicit implicit yet controllable reasoning, and (ii) integrating reward shaping with dynamic gating over reasoning paths. Contribution/Results: Evaluated on the R1-distilled model (DeepSeek-R1-Distill-Qwen-1.5B), AutoThink achieves a 6.4% absolute accuracy gain and 52% token reduction across five major mathematical benchmarks. It is the first work to enable end-to-end, demand-driven reasoning policy learning in LRMs, establishing a new paradigm for efficient, context-aware reasoning.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs.

Problem

Research questions and friction points this paper is trying to address.

Enable LRMs to dynamically decide reasoning based on complexity

Reduce computational overhead from unnecessary step-by-step reasoning

Optimize accuracy-efficiency trade-offs via adaptive thinking policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage RL optimizes adaptive reasoning policies

Ellipsis in prompts triggers controllable reasoning modes

AutoThink balances accuracy and efficiency dynamically

🔎 Similar Papers

No similar papers found.

Authors to Follow