Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of single-policy exploration in large-scale reinforcement learning and the instability or inefficiency often caused by existing ensemble methods due to excessive exploration. Through theoretical analysis, the study reveals the critical role of policy diversity in learning efficiency and proposes a coupled policy optimization approach. By introducing a KL divergence constraint within an ensemble policy gradient framework, the method explicitly regulates inter-policy diversity to enable efficient and stable structured exploration. Evaluated in large-scale parallel reinforcement learning systems, the proposed approach significantly outperforms baseline methods—including SAPG, PBT, and PPO—on complex tasks such as dexterous manipulation, achieving substantial improvements in both sample efficiency and final performance.

Technology Category

Application Category

📝 Abstract
Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. Project page at https://naoki04.github.io/paper-cpo/ .
Problem

Research questions and friction points this paper is trying to address.

policy diversity
ensemble policy gradient
large-scale reinforcement learning
exploration efficiency
training stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble Policy Gradient
Policy Diversity
KL Constraint
Coupled Policy Optimization
Sample Efficiency
🔎 Similar Papers
N
Naoki Shitanda
The University of Tokyo, Tokyo, Japan; RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
M
Motoki Omura
The University of Tokyo, Tokyo, Japan
Tatsuya Harada
Tatsuya Harada
The University of Tokyo
Computer VisionMachine LearningIntelligent Robot
Takayuki Osa
Takayuki Osa
Team Director, RIKEN Center for Advanced Intelligence Project
Robot learningImitation learningReinforcement learningRoboticsMachine Learning