Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the performance degradation in traditional reinforcement learning when using large batch sizes, a challenge rooted in policy non-stationarity that hinders the balance between early-stage plasticity and late-stage convergence stability. To overcome this limitation, the authors propose Adaptive Batch Scaling (ABS), a method that dynamically quantifies policy non-stationarity through a behavioral divergence metric and inversely adjusts the effective batch size accordingly—employing smaller batches early to preserve plasticity and larger batches later to enable precise convergence. Integrated with the Parallelised Q-Network (PQN) algorithm, ABS achieves, for the first time on the Arcade Learning Environment (ALE) benchmark, efficient synergy between large neural networks and large batch sizes, substantially improving performance and challenging prevailing assumptions about scalability limits in deep reinforcement learning.

📝 Abstract

Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. We challenge this view by observing that non-stationarity is not a fixed property of RL, but evolves throughout training: early stages exhibit rapid behavioral shifts that demand small batches for plasticity, whereas late stages approach a quasi-stationary regime where large batches enable precise convergence. Motivated by this observation, we propose Adaptive Batch Scaling (ABS), that dynamically adjusts the effective batch size according to the stability of the learning policy. Central to ABS is Behavioral Divergence, a novel metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, which we use to scale batch size inversely to policy volatility. Integrated with the Parallelised Q-Network (PQN) algorithm and evaluated on the ALE benchmark, ABS seamlessly reconciles early-stage plasticity with late-stage stable convergence. Strikingly, contrary to conventional wisdom, our results reveal that the combination of larger networks and larger batch sizes achieves the best performance - a scaling behavior previously thought to be unattainable in RL, now unlocked through adaptive batch control.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Batch Size

Non-stationarity

Scalability

On-Policy Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Batch Scaling

Behavioral Divergence

On-Policy Reinforcement Learning