HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

πŸ“… 2026-05-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the instability and inefficiency of GRPO-style algorithms in sparse verifiable reward settings, where early training updates are dominated by negative-advantage samples and per-response-length normalization hinders performance. To mitigate these issues, the authors propose Hysteresis Policy Optimization (HPO), which downweights negative-advantage updates and employs average-length normalization to enhance both stability and sample efficiency. They further introduce Adaptive HPO (A-HPO), which dynamically adjusts the hysteresis weight based on the sign distribution of advantages within each batch, eliminating the need for manual hyperparameter tuning. Evaluated within the GRPO framework, the proposed methods significantly outperform GRPO, SAPO, and GSPO on TeleLogs and Countdown tasks, achieving a final reward of 0.84 on TeleLogsβ€”an improvement of up to 15%β€”and delivering the greatest gains early in training across multiple model scales.
πŸ“ Abstract
We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.
Problem

Research questions and friction points this paper is trying to address.

sparse reward
reinforcement learning
policy optimization
training stability
advantage estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hysteretic Policy Optimization
Sparse Reward
Adaptive Weighting
Length Normalization
Reinforcement Learning
πŸ”Ž Similar Papers
No similar papers found.