Mitigating Overthinking through Reasoning Shaping

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) trained under reinforcement learning with verifier rewards (RLVR) suffer from “overthinking”—generating lengthy, inefficient reasoning traces that increase computational overhead and degrade performance. Existing token-level penalty methods impair accuracy due to coarse-grained supervision. To address this, we propose **Group Relative Segment Penalization (GRSP)**, a segment-level fine-grained regularization method: it clusters reasoning steps into semantically coherent segments and applies length-aware, dynamically weighted relative penalties at the segment level to enforce structured reasoning. GRSP significantly improves training stability and model scalability. Experiments demonstrate that GRSP reduces inference cost by 32%–47% on complex tasks while preserving or even improving accuracy—marking the first RLVR framework to jointly achieve high efficiency and reliability.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.
Problem

Research questions and friction points this paper is trying to address.

Mitigating overthinking in large reasoning models
Reducing excessive reasoning while maintaining accuracy
Balancing computational efficiency with model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-level penalization method regularizes reasoning segments
Length-aware weighting mechanism across segment clusters
Stabilizes training and scales across model sizes
🔎 Similar Papers
No similar papers found.
Feifan Song
Feifan Song
Peking University
Natural Language Processing
S
Shaohang Wei
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Bofei Gao
Bofei Gao
Peking University
Natural Language Processing
Yejie Wang
Yejie Wang
Beijing University of Posts and Telecommunications
Natural Language Processing
Wen Luo
Wen Luo
Peking University
W
Wei Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Linli Yao
Linli Yao
Peking University
multi-modal semantic understanding
Weimin Xiong
Weimin Xiong
Peking University
Computer Science
L
Liang Chen
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
T
Tianyu Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
H
Houfeng Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University