π€ AI Summary
In sparse-reward environments, existing population-based policy optimization methods are prone to issues such as advantage collapse, high-variance gradients, or distributional bias introduced by mixed policies. This work proposes the Hindsight-Anchored Policy Optimization (HAPO) framework, which incorporates a Synthetic Success Injection (SSI) mechanism to selectively leverage teacher demonstrations upon policy failure and employs a Thompson-sampling-inspired gating mechanism to construct a self-paced curriculum. The approach achieves asymptotic consistency by automatically annealing teacher signals as the agentβs performance improves, ensuring that off-policy guidance serves only as a temporary scaffold rather than a performance ceiling. Experimental results demonstrate that HAPO effectively mitigates training instability, recovers unbiased policy gradients, and enables agents to surpass the performance of the provided teacher demonstrations.
π Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures off-policy guidance acts as a temporary scaffold rather than a persistent ceiling, enabling the model to surpass the limitations of static teacher forcing.