🤖 AI Summary
Proximal Policy Optimization (PPO) frequently suffers from directional bias in importance ratios—i.e., ratios decrease under positive advantages and increase under negative advantages—undermining optimization stability and performance, yet this phenomenon has lacked systematic investigation. Method: We propose Directional-Clamp PPO, the first method to explicitly model and suppress such directional deviation. It introduces a direction-aware clipping mechanism featuring a tunable threshold β and asymmetric, steep loss gradients that penalize only updates moving in the “wrong” direction, while preserving ratios near unity. The method seamlessly integrates into the standard PPO framework and remains compatible with importance sampling and advantage estimation. Contribution/Results: On multi-task MuJoCo benchmarks, Directional-Clamp PPO significantly outperforms vanilla PPO and leading variants, demonstrating consistent robustness across random seeds. Theoretical analysis confirms its ability to provably avoid harmful policy updates, thereby enhancing the reliability of policy optimization.
📝 Abstract
Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the"right"direction -- starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these"right"direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective's behavior by altering the clipping in the"right"direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the"wrong"direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict"wrong"direction regions, where the advantage is positive (negative) and importance ratio falls below (above) $1 - eta$ ($1+eta$), for a tunable parameter $eta in (0, 1)$. The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective's behavior in the"right"direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid"wrong"direction updates while keeping the importance ratio closer to 1.