Evolutionary Policy Optimization

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing policy gradient methods (e.g., PPO) suffer from performance saturation under large-scale parallelization, while evolutionary reinforcement learning (EvoRL) exhibits poor sample efficiency. To address this dual bottleneck, we propose EvPG—the first algorithm that intrinsically integrates evolutionary search into the policy gradient update framework. Built upon PPO, EvPG introduces a stochastic perturbation-based population generation mechanism, elite preservation, and a hybrid gradient-evolutionary update rule, enabling population diversity to evolve under gradient-guided direction. This design synergistically combines the sample efficiency of policy gradients with the exploration robustness of evolutionary methods and supports GPU-accelerated parallel simulation. On standard benchmarks—including MuJoCo and ProcGen—EvPG achieves an average 37% performance improvement over PPO under 128-environment parallelism, demonstrates superior scalability, and overcomes the traditional sample-efficiency limitations of EvoRL.

Technology Category

Application Category

📝 Abstract
Despite its extreme sample inefficiency, on-policy reinforcement learning has become a fundamental tool in real-world applications. With recent advances in GPU-driven simulation, the ability to collect vast amounts of data for RL training has scaled exponentially. However, studies show that current on-policy methods, such as PPO, fail to fully leverage the benefits of parallelized environments, leading to performance saturation beyond a certain scale. In contrast, Evolutionary Algorithms (EAs) excel at increasing diversity through randomization, making them a natural complement to RL. However, existing EvoRL methods have struggled to gain widespread adoption due to their extreme sample inefficiency. To address these challenges, we introduce Evolutionary Policy Optimization (EPO), a novel policy gradient algorithm that combines the strengths of EA and policy gradients. We show that EPO significantly improves performance across diverse and challenging environments, demonstrating superior scalability with parallelized simulations.
Problem

Research questions and friction points this paper is trying to address.

On-policy RL fails to utilize parallelized environments effectively
Evolutionary Algorithms lack sample efficiency despite enhancing diversity
Combining EA and policy gradients improves RL scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines Evolutionary Algorithms with policy gradients
Improves performance in diverse challenging environments
Enhances scalability with parallelized simulations
🔎 Similar Papers
No similar papers found.
Jianren Wang
Jianren Wang
Carnegie Mellon University | Skild AI
Artificial IntelligenceNeuroscienceBiology
Y
Yifan Su
Robotics Institute, Carnegie Mellon University, PA15213, USA
A
Abhinav Gupta
Robotics Institute, Carnegie Mellon University, PA15213, USA
D
Deepak Pathak
Robotics Institute, Carnegie Mellon University, PA15213, USA