Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Large reasoning models (LRMs) suffer from inefficient inference on simple tasks due to over-generation of reasoning tokens (e.g., “wait”, “however”), triggering redundant reflection and backtracking—a phenomenon we formally define as the “reasoning trap”. To address this, we propose Dual Policy Preference Optimization (DuP-PO), a preference-based optimization framework that dynamically regulates reasoning token generation via rollout sampling, fine-grained advantage control, and policy shaping—thereby balancing efficient intuitive reasoning with necessary deep reasoning. Evaluated on five mathematical reasoning benchmarks, DuP-PO achieves significant improvements in token efficiency (+23.6% on average) while outperforming state-of-the-art baselines. This work not only uncovers an implicit inefficiency mechanism in LRMs for simple tasks but also introduces the first preference optimization framework explicitly designed for controllable reasoning token generation.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma. When handling simple tasks, they often produce verbose responses overloaded with thinking tokens (e.g., wait, however). These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency. In this work, our pilot study reveals that these thinking-token-induced behaviors are not essential for effective problem-solving and may even hinder correct reasoning within constrained token budgets. We identify this phenomenon as the thinking trap. To mitigate this issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel algorithm featuring: (1) A rollout sampling strategy that guarantees balanced exposure to responses with and without thinking tokens; (2) A fine-grained advantage control technique to dynamically regulate the prediction of target tokens; (3) A policy shaping method ensuring stable gradient contributions from thinking tokens. Experimental results on five popular math reasoning benchmarks show that DuP-PO performs well on the popular LRM, which significantly improves their token efficiency during reasoning, while achieving superior performance of the base model.

Problem

Research questions and friction points this paper is trying to address.

LRMs overuse thinking tokens in simple tasks

Thinking tokens reduce efficiency and hinder reasoning

Propose DuP-PO to optimize token usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Policy Preference Optimization algorithm

Rollout sampling for balanced token exposure

Fine-grained advantage control technique

🔎 Similar Papers

Rational Metareasoning for Large Language Models