π€ AI Summary
This work addresses the instability and inefficiency of GRPO-style algorithms in sparse verifiable reward settings, where early training updates are dominated by negative-advantage samples and per-response-length normalization hinders performance. To mitigate these issues, the authors propose Hysteresis Policy Optimization (HPO), which downweights negative-advantage updates and employs average-length normalization to enhance both stability and sample efficiency. They further introduce Adaptive HPO (A-HPO), which dynamically adjusts the hysteresis weight based on the sign distribution of advantages within each batch, eliminating the need for manual hyperparameter tuning. Evaluated within the GRPO framework, the proposed methods significantly outperform GRPO, SAPO, and GSPO on TeleLogs and Countdown tasks, achieving a final reward of 0.84 on TeleLogsβan improvement of up to 15%βand delivering the greatest gains early in training across multiple model scales.
π Abstract
We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.