WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses key limitations of existing Group-Relative Policy Optimization (GRPO) methods in complex reasoning tasks, where reliance on group-wide relative rewards often leads to over-reasoning, low sample efficiency, and difficulty in balancing reasoning length with accuracy. To overcome these issues, the authors propose Weakly Supervised GRPO (WS-GRPO), which, for the first time, leverages only the correctness label of the final answer to construct a prefix-level preference model that dynamically guides whether to continue or terminate the reasoning process. By avoiding the calibration challenges inherent in traditional global length penalties, WS-GRPO significantly shortens reasoning trajectories across multiple benchmarks while maintaining accuracy comparable to standard GRPO, thereby substantially improving sampling efficiency.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the trade-off between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal rewards into correctness-aware guidance over partial trajectories. Unlike global length penalties that are hard to calibrate, WS-GRPO trains a preference model from outcome-only correctness to produce prefix-level signals that indicate when additional continuation is beneficial. Thus, WS-GRPO supplies outcome-derived continue/stop guidance, reducing redundant deliberation while maintaining accuracy. We provide theoretical results and empirically show on reasoning benchmarks that WS-GRPO substantially reduces rollout length while remaining competitive with GRPO baselines.

Problem

Research questions and friction points this paper is trying to address.

reasoning efficiency

overthinking

length control

weak supervision

rollout optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly-Supervised Learning

Group-Relative Policy Optimization

Rollout Efficiency