Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the performance stagnation commonly observed in Proximal Policy Optimization (PPO) during training, which arises from a mismatch between gradient noise in policy updates and the effective step size. For the first time, the problem is formally modeled through the lens of outer-loop stochastic optimization, revealing that stagnation occurs when the effective step size becomes excessively large relative to the magnitude of gradient noise. To mitigate this, the authors propose reducing gradient noise via large-scale parallel environment sampling and concurrently scaling down learning rates and other hyperparameters to appropriately shrink the effective step size. The resulting approach achieves stable and monotonic performance improvements across complex open-domain tasks, scaling up to one trillion environment transitions and significantly outperforming existing baselines.

Technology Category

Application Category

📝 Abstract

Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples collected between policy update steps. This model predicts that performance will plateau at a suboptimal level if the outer step size is too large relative to the noise. Recasting PPO in this light makes it clear that there are two ways to address this particular type of learning stagnation: either reduce the step size or increase the number of samples collected between updates. We first validate the predictions of our model and investigate how hyperparameter choices influence the step size and update noise, concluding that increasing the number of parallel environments is a simple and robust way to reduce both factors. Next, we propose a recipe for how to co-scale the other hyperparameters when increasing parallelization, and show that incorrectly doing so can lead to severe performance degradation. Finally, we vastly outperform prior baselines in a complex open-ended domain by scaling PPO to more than 1M parallel environments, thereby enabling monotonic performance improvement up to one trillion transitions.

Problem

Research questions and friction points this paper is trying to address.

learning stagnation

plateau

PPO

on-policy RL

sample-based estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

PPO

learning stagnation

parallel environments