Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning

📅 2024-04-16

🏛️ Neural Information Processing Systems

📈 Citations: 4

✨ Influential: 1

career value

239K/year

🤖 AI Summary

This work addresses provably efficient stochastic exploration in cooperative multi-agent reinforcement learning (MARL). We propose the first unified stochastic exploration framework for parallel Markov decision processes (MDPs), introducing two novel Thompson sampling algorithms—CoopTS-PHE and CoopTS-LMC—that integrate perturbed-historical exploration (PHE) and Langevin Monte Carlo (LMC), respectively, enabling distributed decision-making with low communication overhead. Our contribution is the first provably efficient stochastic exploration theory for cooperative MARL, uncovering intrinsic connections to federated learning. Empirically, the algorithms significantly outperform baselines on N-chain, video game, and energy system benchmarks. Theoretically, they achieve $widetilde{O}(d^{3/2}H^2sqrt{MK})$ regret and $widetilde{O}(dHM^2)$ communication cost, while maintaining robustness to model misspecification.

Technology Category

Application Category

📝 Abstract

We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL). We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC, incorporating the perturbed-history exploration (PHE) strategy and the Langevin Monte Carlo exploration (LMC) strategy respectively, which are flexible in design and easy to implement in practice. For a special class of parallel MDPs where the transition is (approximately) linear, we theoretically prove that both CoopTS-PHE and CoopTS-LMC achieve a $widetilde{mathcal{O}}(d^{3/2}H^2sqrt{MK})$ regret bound with communication complexity $widetilde{mathcal{O}}(dHM^2)$, where $d$ is the feature dimension, $H$ is the horizon length, $M$ is the number of agents, and $K$ is the number of episodes. This is the first theoretical result for randomized exploration in cooperative MARL. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem ( extit{i.e.,} $N$-chain), a video game, and a real-world problem in energy systems. Our experimental results support that our framework can achieve better performance, even under conditions of misspecified transition models. Additionally, we establish a connection between our unified framework and the practical application of federated learning.

Problem

Research questions and friction points this paper is trying to address.

Explores efficient randomized exploration in cooperative multi-agent reinforcement learning.

Proposes algorithms for parallel MDPs with theoretical regret bounds.

Connects MARL framework to practical applications like federated learning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified algorithm framework for parallel MDPs

Thompson Sampling with PHE and LMC strategies

Proven regret bound with communication complexity

🔎 Similar Papers

No similar papers found.