🤖 AI Summary
Large reasoning models (LRMs) suffer from excessive token consumption and high latency due to unnecessarily long reasoning chains, while the common assumption that “longer chains imply higher accuracy” lacks rigorous empirical or theoretical support.
Method: We propose a discounted reinforcement learning (DRL) framework that explicitly incorporates a small per-token cost into policy training, thereby optimizing for both accuracy and reasoning efficiency. Leveraging Blackwell optimality theory under constrained policy classes, we formally establish that concise reasoning is theoretically optimal when accuracy is preserved. Our approach requires no supervision, labels, or human annotations—chain compression is driven solely by token-level penalties.
Contribution/Results: Experiments across multiple reasoning benchmarks demonstrate that our method maintains or even improves accuracy while reducing chain length by 35–52% on average, yielding substantial reductions in computational cost and response latency. This work introduces a new paradigm for efficient and interpretable LRM inference.
📝 Abstract
Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.