Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the inefficiency of Proximal Policy Optimization (PPO) under path-dependent noise during multi-iteration updates, which leads to signal saturation and accumulation of ineffective policy updates. To overcome this limitation, the authors propose a “wide rather than deep” optimization paradigm: multiple PPO instances, differing only in mini-batch ordering, are executed in parallel, and their policies are aggregated via log-opinion pooling in the natural parameter space guided by Fisher information geometry. This consensus update is further regularized with a KL constraint to maintain trust-region compliance. Notably, the method requires no additional environment interactions and achieves up to an 8.6× performance improvement over standard PPO under a fixed sample budget in continuous control tasks, significantly outperforming both vanilla PPO and computation-matched deep-optimization baselines.

Technology Category

Application Category

📝 Abstract

Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: $K$ PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.

Problem

Research questions and friction points this paper is trying to address.

policy optimization

proximal policy optimization

natural gradient

trust region

optimization depth

Innovation

Methods, ideas, or system contributions that make the work stand out.

Consensus Aggregation

Policy Optimization

Natural Gradient