Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) suffer from inefficient inference on simple tasks due to over-generation of reasoning tokens (e.g., “wait”, “however”), triggering redundant reflection and backtracking—a phenomenon we formally define as the “reasoning trap”. To address this, we propose Dual Policy Preference Optimization (DuP-PO), a preference-based optimization framework that dynamically regulates reasoning token generation via rollout sampling, fine-grained advantage control, and policy shaping—thereby balancing efficient intuitive reasoning with necessary deep reasoning. Evaluated on five mathematical reasoning benchmarks, DuP-PO achieves significant improvements in token efficiency (+23.6% on average) while outperforming state-of-the-art baselines. This work not only uncovers an implicit inefficiency mechanism in LRMs for simple tasks but also introduces the first preference optimization framework explicitly designed for controllable reasoning token generation.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma. When handling simple tasks, they often produce verbose responses overloaded with thinking tokens (e.g., wait, however). These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency. In this work, our pilot study reveals that these thinking-token-induced behaviors are not essential for effective problem-solving and may even hinder correct reasoning within constrained token budgets. We identify this phenomenon as the thinking trap. To mitigate this issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel algorithm featuring: (1) A rollout sampling strategy that guarantees balanced exposure to responses with and without thinking tokens; (2) A fine-grained advantage control technique to dynamically regulate the prediction of target tokens; (3) A policy shaping method ensuring stable gradient contributions from thinking tokens. Experimental results on five popular math reasoning benchmarks show that DuP-PO performs well on the popular LRM, which significantly improves their token efficiency during reasoning, while achieving superior performance of the base model.
Problem

Research questions and friction points this paper is trying to address.

LRMs overuse thinking tokens in simple tasks
Thinking tokens reduce efficiency and hinder reasoning
Propose DuP-PO to optimize token usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Policy Preference Optimization algorithm
Rollout sampling for balanced token exposure
Fine-grained advantage control technique
🔎 Similar Papers
B
Bowen Ding
Zhejiang University, School of Engineering, Westlake University
Y
Yuhan Chen
Boston University
F
Futing Wang
Zhejiang University, School of Engineering, Westlake University
Lingfeng Ming
Lingfeng Ming
Alibaba Group
Large Language ModelNatural Language Processing
T
Tao Lin
School of Engineering, Westlake University, Research Center for Industries of the Future, Westlake University