Concise Reasoning via Reinforcement Learning

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Chain-of-thought (CoT) reasoning in large language models incurs excessive token consumption and latency due to inherent redundancy, yet the theoretical origins of such redundancy—particularly under reinforcement learning (RL)—remain unexamined. Method: This work first establishes, theoretically, that standard RL optimization inherently induces redundant CoT generation; it further identifies an intrinsic positive correlation between CoT conciseness and reasoning accuracy. Building on these insights, we propose a lightweight, two-stage RL-based pruning paradigm that avoids full retraining: it integrates Proximal Policy Optimization (PPO), few-shot reward modeling, and CoT distillation. Results: Evaluated across multiple mathematical and logical reasoning benchmarks, our method achieves an average 42% reduction in CoT length while maintaining or improving accuracy by up to 1.3%, significantly lowering computational cost and inference latency.

Technology Category

Application Category

📝 Abstract
Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. Moreover, we show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model's chain of thought while maintaining or even enhancing accuracy. Finally, we validate our conclusions through extensive experimental results.
Problem

Research questions and friction points this paper is trying to address.

Reducing token usage in LLMs to cut costs and speed responses
Challenging the link between response length and reasoning accuracy
Using post-training RL to shorten chains of thought without losing accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL reduces lengthy responses via post-training
Conciseness correlates with accuracy naturally
Secondary RL phase enhances reasoning efficiency
🔎 Similar Papers
No similar papers found.