Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the limited expressiveness of Gaussian policies in decentralized soft actor-critic methods, which struggle to model multimodal action distributions and thereby hinder exploration in multi-agent reinforcement learning—a limitation that worsens as the number of agents increases. To overcome this, we propose Decentralized Diffusion Policy Learning (DDPL), the first approach to integrate denoising diffusion probabilistic models into decentralized multi-agent policy learning, enabling flexible modeling of complex action distributions. We introduce a theoretically grounded importance-sampling-based score matching algorithm that facilitates efficient online training. Experimental results demonstrate that DDPL significantly outperforms existing methods across a range of continuous control benchmarks, including Multi-Agent Particle Environments, MuJoCo, IsaacLab, and the StarCraft Multi-Agent Challenge.

📝 Abstract

Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy-based policy updates. In practice, however, such energy-based policies are intractable to maintain and are commonly projected onto the Gaussian policy class. In this work, we show that the limited expressiveness of Gaussian policies severely hinders exploration in DecSPG, and this limitation worsens as the number of agents grows. To address this issue, we propose decentralized diffusion policy learning (DDPL), which parameterizes each agent's policy with a denoising diffusion probabilistic model, an expressive generative model that captures multi-modal action distributions for enhanced exploration. DDPL enables efficient online training of diffusion policies via importance sampling score matching (ISSM), a novel training method with theoretical guarantee. We evaluate DDPL on representative continuous-action MARL benchmarks, including multi-agent particle environment, multi-agent MuJoCo, IsaacLab, and JAX-reimplemented StarCraft multi-agent challenge, and observe consistently improved performance.

Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning

exploration

decentralized policy

Gaussian policy

expressiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion policy

multi-agent reinforcement learning

exploration