Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

In multi-agent cooperation, adversarial optimization often induces self-destructive behaviors—agents sacrifice task performance to suppress opponents, leading to irrational, less robust, and less diverse policies. To address this, we propose the Rationality-Preserving Optimization (RPO) framework and the Rational Policy Gradient (RPG) algorithm, which jointly integrate opponent modeling and explicit rationality constraints to suppress self-destructive policy updates in non-zero-sum games. RPG augments the policy gradient with a rationality regularization term and employs opponent shaping to steer learning toward cooperative equilibria. Extensive experiments across cooperative (e.g., Overcooked), mixed-motive (e.g., Predator-Prey), and general-sum environments demonstrate that RPO significantly improves policy robustness (+23.6% task success under environmental perturbations), adaptability (+19.4% cross-opponent generalization), and diversity (+31.2% policy entropy) over baselines. This work provides the first systematic solution to rationality collapse in adversarial training for cooperative multi-agent settings.

Technology Category

Application Category

📝 Abstract

Adversarial optimization algorithms that explicitly search for flaws in agents'policies have been successfully applied to finding robust and diverse policies in multi-agent settings. However, the success of adversarial optimization has been largely limited to zero-sum settings because its naive application in cooperative settings leads to a critical failure mode: agents are irrationally incentivized to self-sabotage, blocking the completion of tasks and halting further learning. To address this, we introduce Rationality-preserving Policy Optimization (RPO), a formalism for adversarial optimization that avoids self-sabotage by ensuring agents remain rational--that is, their policies are optimal with respect to some possible partner policy. To solve RPO, we develop Rational Policy Gradient (RPG), which trains agents to maximize their own reward in a modified version of the original game in which we use opponent shaping techniques to optimize the adversarial objective. RPG enables us to extend a variety of existing adversarial optimization algorithms that, no longer subject to the limitations of self-sabotage, can find adversarial examples, improve robustness and adaptability, and learn diverse policies. We empirically validate that our approach achieves strong performance in several popular cooperative and general-sum environments. Our project page can be found at https://rational-policy-gradient.github.io.

Problem

Research questions and friction points this paper is trying to address.

Preventing self-sabotage in cooperative multi-agent adversarial optimization

Extending adversarial methods beyond zero-sum to cooperative settings

Ensuring agent rationality while improving robustness and policy diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

RPG uses opponent shaping to optimize adversarial objectives

RPO formalism prevents self-sabotage by preserving agent rationality

Method extends adversarial optimization to cooperative multi-agent settings

🔎 Similar Papers

No similar papers found.