🤖 AI Summary
Chain-of-thought (CoT) reasoning in large language models incurs excessive token consumption and latency due to inherent redundancy, yet the theoretical origins of such redundancy—particularly under reinforcement learning (RL)—remain unexamined. Method: This work first establishes, theoretically, that standard RL optimization inherently induces redundant CoT generation; it further identifies an intrinsic positive correlation between CoT conciseness and reasoning accuracy. Building on these insights, we propose a lightweight, two-stage RL-based pruning paradigm that avoids full retraining: it integrates Proximal Policy Optimization (PPO), few-shot reward modeling, and CoT distillation. Results: Evaluated across multiple mathematical and logical reasoning benchmarks, our method achieves an average 42% reduction in CoT length while maintaining or improving accuracy by up to 1.3%, significantly lowering computational cost and inference latency.
📝 Abstract
Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. Moreover, we show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model's chain of thought while maintaining or even enhancing accuracy. Finally, we validate our conclusions through extensive experimental results.