Evolutionary Policy Optimization

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing policy gradient methods (e.g., PPO) suffer from performance saturation under large-scale parallelization, while evolutionary reinforcement learning (EvoRL) exhibits poor sample efficiency. To address this dual bottleneck, we propose EvPG—the first algorithm that intrinsically integrates evolutionary search into the policy gradient update framework. Built upon PPO, EvPG introduces a stochastic perturbation-based population generation mechanism, elite preservation, and a hybrid gradient-evolutionary update rule, enabling population diversity to evolve under gradient-guided direction. This design synergistically combines the sample efficiency of policy gradients with the exploration robustness of evolutionary methods and supports GPU-accelerated parallel simulation. On standard benchmarks—including MuJoCo and ProcGen—EvPG achieves an average 37% performance improvement over PPO under 128-environment parallelism, demonstrates superior scalability, and overcomes the traditional sample-efficiency limitations of EvoRL.

Technology Category

Application Category

📝 Abstract

Despite its extreme sample inefficiency, on-policy reinforcement learning has become a fundamental tool in real-world applications. With recent advances in GPU-driven simulation, the ability to collect vast amounts of data for RL training has scaled exponentially. However, studies show that current on-policy methods, such as PPO, fail to fully leverage the benefits of parallelized environments, leading to performance saturation beyond a certain scale. In contrast, Evolutionary Algorithms (EAs) excel at increasing diversity through randomization, making them a natural complement to RL. However, existing EvoRL methods have struggled to gain widespread adoption due to their extreme sample inefficiency. To address these challenges, we introduce Evolutionary Policy Optimization (EPO), a novel policy gradient algorithm that combines the strengths of EA and policy gradients. We show that EPO significantly improves performance across diverse and challenging environments, demonstrating superior scalability with parallelized simulations.

Problem

Research questions and friction points this paper is trying to address.

On-policy RL fails to utilize parallelized environments effectively

Evolutionary Algorithms lack sample efficiency despite enhancing diversity

Combining EA and policy gradients improves RL scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines Evolutionary Algorithms with policy gradients

Improves performance in diverse challenging environments

Enhances scalability with parallelized simulations

🔎 Similar Papers

No similar papers found.