π€ AI Summary
This work addresses the inefficiency of Proximal Policy Optimization (PPO) under path-dependent noise during multi-iteration updates, which leads to signal saturation and accumulation of ineffective policy updates. To overcome this limitation, the authors propose a βwide rather than deepβ optimization paradigm: multiple PPO instances, differing only in mini-batch ordering, are executed in parallel, and their policies are aggregated via log-opinion pooling in the natural parameter space guided by Fisher information geometry. This consensus update is further regularized with a KL constraint to maintain trust-region compliance. Notably, the method requires no additional environment interactions and achieves up to an 8.6Γ performance improvement over standard PPO under a fixed sample budget in continuous control tasks, significantly outperforming both vanilla PPO and computation-matched deep-optimization baselines.
π Abstract
Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: $K$ PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.