🤖 AI Summary
In decentralized multi-agent deep reinforcement learning (MADRL), communication facilitates coordination but introduces significant uncertainty, substantially increasing policy gradient variance and undermining training stability. This work presents the first theoretical modeling and quantitative analysis of communication-induced policy gradient variance. We propose a modular variance suppression framework that employs control variates to design a communication-aware gradient correction module—lightweight, plug-and-play, and compatible with mainstream algorithms including MAPPO and QMix. Evaluated on StarCraft II and Traffic Junction benchmarks, our approach consistently reduces policy gradient variance, improves convergence stability, and enhances final task performance. Empirical results validate both the effectiveness and generalizability of our variance modeling and suppression methodology across diverse cooperative multi-agent settings.
📝 Abstract
In decentralized multi-agent deep reinforcement learning (MADRL), communication can help agents to gain a better understanding of the environment to better coordinate their behaviors. Nevertheless, communication may involve uncertainty, which potentially introduces variance to the learning of decentralized agents. In this paper, we focus on a specific decentralized MADRL setting with communication and conduct a theoretical analysis to study the variance that is caused by communication in policy gradients. We propose modular techniques to reduce the variance in policy gradients during training. We adopt our modular techniques into two existing algorithms for decentralized MADRL with communication and evaluate them on multiple tasks in the StarCraft Multi-Agent Challenge and Traffic Junction domains. The results show that decentralized MADRL communication methods extended with our proposed techniques not only achieve high-performing agents but also reduce variance in policy gradients during training.