🤖 AI Summary
This work addresses the instability of existing cooperative multi-agent reinforcement learning (MARL) methods under environmental uncertainties—such as sim-to-real gaps, model mismatches, and noise—by introducing distributionally robust optimization into value-decomposition MARL for the first time. The authors propose the Distributionally Robust Individual-Global Maximization (DrIGM) principle, which ensures system-wide robustness through robust individual action-value estimates under decentralized execution. This framework yields plug-and-play robust variants of mainstream architectures like VDN, QMIX, and QTRAN without requiring modifications to the reward design. Theoretical analysis provides formal robustness guarantees, and experiments demonstrate that the approach significantly improves out-of-distribution generalization on both the high-fidelity SustainGym simulator and the StarCraft Multi-Agent Challenge, while maintaining strong scalability and ease of implementation.
📝 Abstract
Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings remains unreliable due to environmental uncertainties arising from the sim-to-real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team-optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM-compliant robust variants of existing value-factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q-targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per-agent reward shaping. Empirically, on high-fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out-of-distribution performance. Code and data are available at https://github.com/crqu/robust-coMARL.