🤖 AI Summary
This work addresses the inefficiency and instability in training caused by sparse rewards from binary verifiers, as well as the performance limitations of existing methods that uniformly penalize any deviation from a reference model regardless of its quality. To overcome these issues, we propose One-Way Policy Optimization (OWPO), which decouples update direction and magnitude: the verifier determines the optimization direction, while the reference policy solely modulates the step size. OWPO further introduces an asymmetric reweighting mechanism—accelerating alignment for updates worse than the reference and locking in gains for superior deviations—thereby establishing a ratchet effect that continuously consolidates performance improvements. Experimental results demonstrate that OWPO significantly outperforms strong baselines such as DAPO, OPD, and MOPD across multiple benchmarks, breaking free from fixed priors and achieving sustained self-evolution without reliance on external reference models.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performs Accelerated Alignment for inferior deviations (where the policy lags behind the reference) and Gain Locking for superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect'' that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.