Learning to Reason Efficiently with Discounted Reinforcement Learning

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large reasoning models (LRMs) suffer from excessive token consumption and high latency due to unnecessarily long reasoning chains, while the common assumption that “longer chains imply higher accuracy” lacks rigorous empirical or theoretical support. Method: We propose a discounted reinforcement learning (DRL) framework that explicitly incorporates a small per-token cost into policy training, thereby optimizing for both accuracy and reasoning efficiency. Leveraging Blackwell optimality theory under constrained policy classes, we formally establish that concise reasoning is theoretically optimal when accuracy is preserved. Our approach requires no supervision, labels, or human annotations—chain compression is driven solely by token-level penalties. Contribution/Results: Experiments across multiple reasoning benchmarks demonstrate that our method maintains or even improves accuracy while reducing chain length by 35–52% on average, yielding substantial reductions in computational cost and response latency. This work introduces a new paradigm for efficient and interpretable LRM inference.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reduces excessive token usage in large reasoning models

Challenges assumption that longer responses improve accuracy

Encourages concise reasoning while preserving accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Penalizes reasoning tokens with discounted reinforcement learning

Encourages concise reasoning through Blackwell optimality analysis

Shortens chains of thought while preserving accuracy

🔎 Similar Papers

On shallow planning under partial observability