🤖 AI Summary
Existing policy gradient methods (e.g., PPO) suffer from performance saturation under large-scale parallelization, while evolutionary reinforcement learning (EvoRL) exhibits poor sample efficiency. To address this dual bottleneck, we propose EvPG—the first algorithm that intrinsically integrates evolutionary search into the policy gradient update framework. Built upon PPO, EvPG introduces a stochastic perturbation-based population generation mechanism, elite preservation, and a hybrid gradient-evolutionary update rule, enabling population diversity to evolve under gradient-guided direction. This design synergistically combines the sample efficiency of policy gradients with the exploration robustness of evolutionary methods and supports GPU-accelerated parallel simulation. On standard benchmarks—including MuJoCo and ProcGen—EvPG achieves an average 37% performance improvement over PPO under 128-environment parallelism, demonstrates superior scalability, and overcomes the traditional sample-efficiency limitations of EvoRL.
📝 Abstract
Despite its extreme sample inefficiency, on-policy reinforcement learning has become a fundamental tool in real-world applications. With recent advances in GPU-driven simulation, the ability to collect vast amounts of data for RL training has scaled exponentially. However, studies show that current on-policy methods, such as PPO, fail to fully leverage the benefits of parallelized environments, leading to performance saturation beyond a certain scale. In contrast, Evolutionary Algorithms (EAs) excel at increasing diversity through randomization, making them a natural complement to RL. However, existing EvoRL methods have struggled to gain widespread adoption due to their extreme sample inefficiency. To address these challenges, we introduce Evolutionary Policy Optimization (EPO), a novel policy gradient algorithm that combines the strengths of EA and policy gradients. We show that EPO significantly improves performance across diverse and challenging environments, demonstrating superior scalability with parallelized simulations.