Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

This work addresses the inefficiency of large reasoning models in long-chain reasoning, where redundant computation and overthinking often degrade performance. Existing approaches—such as length penalties during training or early-exit mechanisms during inference—frequently compromise accuracy or introduce additional overhead. To overcome these limitations, the authors propose Step-GRPO, a novel framework that internalizes dynamic early-exit capability directly into the model. By treating semantic reasoning steps rather than individual tokens as the fundamental unit of optimization, Step-GRPO integrates dynamic truncation with backtracking and a step-aware relative reward mechanism, enabling efficient reinforcement learning during post-training. Evaluated on three models including Qwen3-8B, the method reduces token consumption by 32.0% while maintaining or even improving reasoning accuracy, achieving a superior trade-off between accuracy and efficiency across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0\% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods.

Problem

Research questions and friction points this paper is trying to address.

early-exit

reasoning efficiency

overthinking

token consumption

redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-GRPO

dynamic early-exit

semantic reasoning steps