🤖 AI Summary
Large language models often struggle to optimize intermediate reasoning steps due to sparse rewards in multi-step reasoning tasks. This work proposes PRPO, a novel method that effectively integrates outcome-based rewards with fine-grained feedback from a process reward model (PRM) within an actor-only framework—eliminating the need for a critic network. PRPO segments reasoning sequences into semantic logical units, normalizes PRM scores into token-level advantages, and introduces a positional shift parameter to align the distribution of process advantages with that of outcome advantages. This alignment prevents policy collapse and enables efficient credit assignment. Evaluated on MATH500, PRPO improves the accuracy of Qwen2.5-Math-1.5B from 61.2% to 64.4% using only eight rollouts per problem, outperforming GRPO without requiring a value network.
📝 Abstract
Policy optimization for large language models often suffers from sparse reward signals in multi-step reasoning tasks. Critic-free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low-reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process-level guidance in a critic-free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token-level advantages, and aligns their distribution with outcome advantages through location-parameter shift. On MATH500, PRPO improves Qwen2.5-Math-1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine-grained credit assignment within critic-free optimization. Code is available at: https://github.com/SchumiDing/srpocode