Rethinking Ratio-Based Trust Regions for Policy Optimization in Multi-Agent Reinforcement Learning

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the high variance in advantage estimation caused by non-stationary teammate policies in multi-agent reinforcement learning, which undermines the effectiveness of ratio-based trust-region methods such as MAPPO and MASPO. To mitigate this issue, the authors propose MARS, a novel policy optimization objective that replaces conventional additive ratio clipping or soft penalty mechanisms with a multiplicative symmetric geometric barrier within the centralized training with decentralized execution (CTDE) framework. This design imposes unbounded penalties on probability ratios approaching zero while preserving informative gradients, thereby preventing policy collapse and vanishing gradients. Empirical evaluation across 47 tasks spanning eight benchmark environments demonstrates that MARS consistently matches or outperforms existing methods, and ablation studies confirm the critical role of the symmetric geometric barrier in its performance gains.

📝 Abstract

Centralized training with decentralized execution (CTDE) is a standard framework for cooperative multi-agent policy-gradient reinforcement learning, allowing agents to learn from joint information while acting from local observations. Ratio-based trust-region methods such as Multi-Agent Proximal Policy Optimization (MAPPO) and Multi-Agent Simple Policy Optimization (MASPO) update decentralized actors using per-agent probability ratios weighted by joint advantage estimates. Teammate non-stationarity increases the variance of these advantages, which in turn increases the variance in the local ratio updates. This exposes two method-specific failure modes: MAPPO's additive clipping removes gradients for outlier samples and weakens recovery from policy drift, while MASPO's soft quadratic penalty can allow probability collapse. We introduce Multi-Agent Ratio Symmetry (MARS), a novel policy optimization objective that replaces these additive ratio-based trust-region mechanisms with a multiplicatively symmetric geometric barrier. MARS preserves corrective gradients while assigning unbounded cost as probability ratios approach zero. Across 47 tasks spanning eight multi-agent environments, including novel JAX benchmarks PaxMen and AeroJAX, MARS matches or exceeds MAPPO and MASPO in aggregate environment-level performance. Ablations show that these gains arise from the geometry of the symmetric barrier rather than from flexible trust-region boundaries alone.

Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning

trust region

policy optimization

non-stationarity

probability ratio

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent reinforcement learning

trust region

policy optimization