Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

📅 2026-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effective entropy-driven exploration and coordination in online multi-agent reinforcement learning (MARL) with diffusion policies, where intractable likelihoods hinder conventional entropy maximization. To overcome this limitation, the paper proposes OMAD, a novel framework that, within the centralized training with decentralized execution (CTDE) paradigm, introduces a scaled joint entropy maximization objective that does not require explicit likelihood computation and integrates it with joint-distribution value function optimization to train decentralized diffusion policies. As the first approach to successfully apply diffusion policies in online MARL, OMAD achieves state-of-the-art performance across ten benchmark tasks from MPE and MAMuJoCo, demonstrating 2.5–5× improvements in sample efficiency over prior methods.

Technology Category

Application Category

📝 Abstract
Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.
Problem

Research questions and friction points this paper is trying to address.

Online Multi-Agent Reinforcement Learning
Diffusion Policies
Entropy-based Exploration
Coordination
Intractable Likelihood
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Policies
Online Multi-Agent Reinforcement Learning
Entropy Maximization
CTDE
Sample Efficiency
🔎 Similar Papers
No similar papers found.
Z
Zhuoran Li
Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University
H
Hai Zhong
Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University
X
Xun Wang
Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University
Qingxin Xia
Qingxin Xia
Osaka University
CHIubiquitous computingmobile sensing
Lihua Zhang
Lihua Zhang
Wuhan University
computational biologybioinformaticsdata mining
Longbo Huang
Longbo Huang
Professor, IIIS, Tsinghua University, ACM Distinguished Scientist
Reinforcement Learning (RL)Deep RLMachine LearningStochastic NetworksPerformance Evaluation