π€ AI Summary
This work addresses the inefficiency of large language models that often generate redundant reasoning chains, incurring unnecessary computational overhead without improving accuracy. To this end, we propose the first framework that explicitly treats reasoning length as a step-level optimization objective within reinforcement learning. Our approach dynamically applies adaptive length penalties by estimating each reasoning stepβs contribution to the final answer through policy-intrinsic log-probability scores. By integrating Group Relative Policy Optimization with a unified outcome-and-process advantage function, the method effectively compresses low-importance steps while preserving critical reasoning components. Experimental results demonstrate that our approach reduces average reasoning length by 64.3% while simultaneously improving accuracy by 5.7% over baseline methods.
π Abstract
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.