Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the inefficiency of large language models that often generate redundant reasoning chains, incurring unnecessary computational overhead without improving accuracy. To this end, we propose the first framework that explicitly treats reasoning length as a step-level optimization objective within reinforcement learning. Our approach dynamically applies adaptive length penalties by estimating each reasoning step’s contribution to the final answer through policy-intrinsic log-probability scores. By integrating Group Relative Policy Optimization with a unified outcome-and-process advantage function, the method effectively compresses low-importance steps while preserving critical reasoning components. Experimental results demonstrate that our approach reduces average reasoning length by 64.3% while simultaneously improving accuracy by 5.7% over baseline methods.

Technology Category

Application Category

📝 Abstract

Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought Reasoning

Length Efficiency

Step-level Optimization

Redundant Reasoning

Test-time Computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-wise Adaptive Penalization

Chain-of-Thought Compression

Step-level Length Penalty