GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the limitation of existing Group Relative Policy Optimization (GRPO) in reinforcement learning, which lacks fine-grained credit assignment over intermediate reasoning steps, often leading to inefficient or excessive reasoning. The authors propose a model-free, verifiable process supervision mechanism that estimates the model’s conditional probability of producing the correct answer at segment boundaries within reasoning trajectories. This yields a segment-level progress metric, refining GRPO’s trajectory-level feedback into more precise policy signals. Notably, the method introduces segment-wise supervision based on conditional probability tracking—without requiring Monte Carlo rollouts or auxiliary models—thereby substantially improving credit assignment accuracy and sample efficiency. Experiments demonstrate consistent improvements over standard GRPO: on mathematical reasoning benchmarks, accuracy increases by up to 2.6% with 13.7% shorter reasoning chains; on general reasoning tasks, accuracy improves by 2.4% with 4% reduced reasoning length.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this paradigm, Group Relative Policy Optimization (GRPO) eliminates the need for critic models but suffers from indiscriminate credit assignment for intermediate steps, which limits its ability to identify effective reasoning strategies and incurs overthinking. In this work, we introduce a model-free and verifiable process supervision via probing the model's belief in the correct answer throughout its reasoning trajectory. By segmenting the generation into discrete steps and tracking the conditional probability of the correct answer appended at each segment boundary, we efficiently compute interpretable segment-wise progress measurements to refine GRPO's trajectory-level feedback. This approach enables more targeted and sample-efficient policy updates, while avoiding the need for intermediate supervision derived from costly Monte Carlo rollouts or auxiliary models. Experiments on mathematical and general-domain benchmarks show consistent gains over GRPO across diverse models: up to 2.6-point accuracy improvements and 13.7% reasoning-length reductions on math tasks, and up to 2.4 points and 4% on general-domain tasks, demonstrating strong generalization.
Problem

Research questions and friction points this paper is trying to address.

credit assignment
intermediate steps
reasoning strategies
overthinking
process supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifiable Process Supervision
Group Relative Policy Optimization
Reinforcement Learning with Verifiable Rewards
Conditional Probability Probing
Sample-Efficient Policy Update