Concise Reasoning via Reinforcement Learning

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Chain-of-thought (CoT) reasoning in large language models incurs excessive token consumption and latency due to inherent redundancy, yet the theoretical origins of such redundancy—particularly under reinforcement learning (RL)—remain unexamined. Method: This work first establishes, theoretically, that standard RL optimization inherently induces redundant CoT generation; it further identifies an intrinsic positive correlation between CoT conciseness and reasoning accuracy. Building on these insights, we propose a lightweight, two-stage RL-based pruning paradigm that avoids full retraining: it integrates Proximal Policy Optimization (PPO), few-shot reward modeling, and CoT distillation. Results: Evaluated across multiple mathematical and logical reasoning benchmarks, our method achieves an average 42% reduction in CoT length while maintaining or improving accuracy by up to 1.3%, significantly lowering computational cost and inference latency.

Technology Category

Application Category

📝 Abstract

Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. Moreover, we show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model's chain of thought while maintaining or even enhancing accuracy. Finally, we validate our conclusions through extensive experimental results.

Problem

Research questions and friction points this paper is trying to address.

Reducing token usage in LLMs to cut costs and speed responses

Challenging the link between response length and reasoning accuracy

Using post-training RL to shorten chains of thought without losing accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL reduces lengthy responses via post-training

Conciseness correlates with accuracy naturally

Secondary RL phase enhances reasoning efficiency

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting