🤖 AI Summary
Large language models (LLMs) for reasoning often generate excessively long reasoning chains for simple questions—termed “overthinking”—wasting tokens and degrading inference efficiency. To address this, we propose a dynamic reasoning quota allocation method that, for the first time, adapts the implicit resource competition mechanism from batched inference to single-sample reasoning. Our approach integrates preference learning to enable adaptive trade-offs between accuracy and reasoning length. Leveraging preference data derived from batched generation, we employ reinforcement learning to achieve fine-grained, structured multi-step reasoning control, augmented by resource compression techniques. Evaluated across multiple mathematical and logical reasoning benchmarks, our method reduces average reasoning token consumption by up to 42% while maintaining or improving accuracy. This effectively mitigates overthinking, enhances inference efficiency, and improves model scalability.
📝 Abstract
Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine-grained control of reasoning behaviors.