🤖 AI Summary
In multi-agent reinforcement learning (MARL), independent policy gradient methods suffer from suboptimal convergence due to joint sampling error—i.e., the deviation of the empirical joint action distribution from the true joint policy distribution caused by independent action sampling over finite trajectories. This error stems from uncoordinated action selection across agents. To address it, we propose MA-PROPS: a method that introduces an adaptive centralized behavioral policy to dynamically identify and compensate for under-sampled joint actions, coupled with joint action probability redistribution to enable robust on-policy sampling. Crucially, MA-PROPS does not rely on the centralized-training-with-decentralized-execution (CTDE) paradigm and supports fully decentralized deployment. Experiments on cooperative and non-conflicting multi-agent tasks demonstrate that MA-PROPS significantly reduces joint sampling error, improves convergence to optimal policies, and enhances training stability. The approach provides a theoretically grounded and empirically efficient pathway to robustify independent policy gradient methods.
📝 Abstract
Independent on-policy policy gradient algorithms are widely used for multi-agent reinforcement learning (MARL) in cooperative and no-conflict games, but they are known to converge suboptimally when each agent's policy gradient points toward a suboptimal equilibrium. In this work, we identify a subtler failure mode that arises extit{even when the expected policy gradients of all agents point toward an optimal solution.} After collecting a finite set of trajectories, stochasticity in independent action sampling can cause the joint data distribution to deviate from the expected joint on-policy distribution. This extit{sampling error} w.r.t. the joint on-policy distribution produces inaccurate gradient estimates that can lead agents to converge suboptimally. In this paper, we investigate if joint sampling error can be reduced through coordinated action selection and whether doing so improves the reliability of policy gradient learning in MARL. Toward this end, we introduce an adaptive action sampling approach to reduce joint sampling error. Our method, Multi-Agent Proximal Robust On-Policy Sampling (MA-PROPS), uses a centralized behavior policy that we continually adapt to place larger probability on joint actions that are currently under-sampled w.r.t. the current joint policy. We empirically evaluate MA-PROPS in a diverse range of multi-agent games and demonstrate that (1) MA-PROPS reduces joint sampling error more efficiently than standard on-policy sampling and (2) improves the reliability of independent policy gradient algorithms, increasing the fraction of training runs that converge to an optimal joint policy.