๐ค AI Summary
This work addresses the limitation of standard outcome-based supervision in reinforcement learning for verifiable reasoning (RLVR), which overly penalizes partially correct yet ultimately failed reasoning trajectories, leading to premature discarding of valuable explorations. To remedy this, the authors propose the SCOPE framework, which employs a process reward model to precisely identify the first erroneous step in suboptimal trajectories and applies fine-grained off-policy correction while maintaining on-policy step-level updatesโthereby avoiding out-of-distribution data. SCOPE enables the first effective recovery of partially correct reasoning paths, significantly enhancing exploration diversity (+13.5%) and generalization. It achieves state-of-the-art performance with 46.6% accuracy on mathematical reasoning tasks and 53.4% on out-of-distribution evaluations.
๐ Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves ineffective.Prior methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model's distribution, but still fail to utilize the largely correct rollouts generated by the model itself and thus do not effectively mitigate the narrowing of the exploration space. To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification. By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%, thereby sustaining a broad exploration space. Extensive experiments demonstrate that our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.