🤖 AI Summary
This work addresses the limitations of Proximal Policy Optimization (PPO)—notably its low sample efficiency due to gradient information loss from hard clipping—and the instability of unclipped methods like Surrogate Policy Optimization (SPO), which suffer from unbounded gradients. To overcome these issues, the authors propose Anchored Neighborhood Optimization (ANO), a unified trust-region framework that incorporates a “re-descending influence principle” to dynamically suppress the impact of outliers on policy updates, eschewing both monotonic penalties and hard thresholds. Theoretical analysis demonstrates that this mechanism is crucial for stability in high-variance stochastic optimization. Empirical results show that ANO achieves state-of-the-art performance on MuJoCo benchmarks, significantly outperforming PPO and SPO, and remains stable even with learning rates up to three times higher than standard values, effectively preventing policy collapse.
📝 Abstract
Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clipping (as in SPO) exposes optimization to unbounded gradients, causing significant instability and hyperparameter sensitivity. To resolve this, we establish a Unified Trust Region Framework that generalizes existing objectives. Within this framework, we derive Anchored Neighborhood Optimization (ANO) based on a set of design principles. We identify that the failure of standard policy gradients stems from a misapplication of gradient influence on outliers. We propose the Redescending Influence Principle, a paradigm shift from monotonic penalties (SPO) and hard-thresholding (PPO) to dynamic outlier suppression, and prove its necessity for stability in high-variance stochastic optimization. Theoretically, we prove ANO possesses the minimal structural complexity required for robust optimization. Empirically, ANO achieves state-of-the-art performance on MuJoCo benchmarks, significantly outperforming PPO and SPO. Notably, ANO demonstrates superior stability, preventing policy collapse even under aggressive hyperparameters (e.g., learning rates 3x larger than standard) where PPO fails completely.