🤖 AI Summary
This work addresses the challenge of premature convergence in deep reinforcement learning, where early entropy collapse often leads to loss of exploration and failure to discover globally optimal policies. To mitigate this issue, the authors propose a lightweight optimistic policy regularization mechanism that, for the first time, directly incorporates high-performing historical trajectories into policy updates. By dynamically maintaining a buffer of successful trajectories and combining directional log-ratio reward shaping with an auxiliary behavioral cloning loss within the PPO framework, the method effectively alleviates premature convergence. Evaluated on 49 Atari games, it surpasses the standard 50-million-step baseline on 22 environments using only 10 million training steps, and outperforms the champion solution in the CAGE Challenge 2 network defense task, demonstrating substantially improved sample efficiency and performance.
📝 Abstract
Deep reinforcement learning agents frequently suffer from premature convergence, where early entropy collapse causes the policy to discard exploratory behaviors before discovering globally optimal strategies. We introduce Optimistic Policy Regularization (OPR), a lightweight mechanism designed to preserve and reinforce historically successful trajectories during policy optimization. OPR maintains a dynamic buffer of high-performing episodes and biases learning toward these behaviors through directional log-ratio reward shaping and an auxiliary behavioral cloning objective. When instantiated on Proximal Policy Optimization (PPO), OPR substantially improves sample efficiency on the Arcade Learning Environment. Across 49 Atari games evaluated at the 10-million step benchmark, OPR achieves the highest score in 22 environments despite baseline methods being reported at the standard 50-million step horizon. Beyond arcade benchmarks, OPR also generalizes to the CAGE Challenge 2 cyber-defense environment, surpassing the competition-winning Cardiff agent while using the same PPO architecture. These results demonstrate that anchoring policy updates to empirically successful trajectories can improve both sample efficiency and final performance.