Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the challenge of effective entropy-driven exploration and coordination in online multi-agent reinforcement learning (MARL) with diffusion policies, where intractable likelihoods hinder conventional entropy maximization. To overcome this limitation, the paper proposes OMAD, a novel framework that, within the centralized training with decentralized execution (CTDE) paradigm, introduces a scaled joint entropy maximization objective that does not require explicit likelihood computation and integrates it with joint-distribution value function optimization to train decentralized diffusion policies. As the first approach to successfully apply diffusion policies in online MARL, OMAD achieves state-of-the-art performance across ten benchmark tasks from MPE and MAMuJoCo, demonstrating 2.5–5× improvements in sample efficiency over prior methods.

Technology Category

Application Category

📝 Abstract

Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.

Problem

Research questions and friction points this paper is trying to address.

Online Multi-Agent Reinforcement Learning

Diffusion Policies

Entropy-based Exploration

Coordination

Intractable Likelihood

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Policies

Online Multi-Agent Reinforcement Learning

Entropy Maximization