Optimistic {epsilon}-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monotonic value decomposition in centralized training with decentralized execution (CTDE) multi-agent reinforcement learning often underestimates optimal action-values, leading to suboptimal policies. Method: We propose an optimistic ε-greedy exploration mechanism that integrates an optimistic update network into the ε-greedy framework—marking the first such incorporation—to dynamically identify potentially optimal actions during decentralized execution and adaptively increase their sampling probability, thereby correcting value estimation bias at the exploration level. Our approach combines monotonic value decomposition, an optimistic action identification network, and a probabilistic ε-resampling strategy within the CTDE paradigm. Results: Evaluated on multiple standard multi-agent benchmarks, our method significantly outperforms baselines including QMIX and QPLEX, achieving an average task completion rate improvement of 12.7% and effectively avoiding convergence to suboptimal policies.

Technology Category

Application Category

📝 Abstract
The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, due to the representational limitations of traditional monotonic value decomposition methods, algorithms can underestimate optimal actions, leading policies to suboptimal solutions. To address this challenge, we propose Optimistic $epsilon$-Greedy Exploration, focusing on enhancing exploration to correct value estimations. The underestimation arises from insufficient sampling of optimal actions during exploration, as our analysis indicated. We introduce an optimistic updating network to identify optimal actions and sample actions from its distribution with a probability of $epsilon$ during exploration, increasing the selection frequency of optimal actions. Experimental results in various environments reveal that the Optimistic $epsilon$-Greedy Exploration effectively prevents the algorithm from suboptimal solutions and significantly improves its performance compared to other algorithms.
Problem

Research questions and friction points this paper is trying to address.

Address underestimation in multi-agent reinforcement learning
Enhance exploration to correct value estimations
Increase optimal action selection frequency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic ε-Greedy Exploration
Centralized Training Decentralized Execution
Optimistic Updating Network
🔎 Similar Papers
No similar papers found.
Ruoning Zhang
Ruoning Zhang
UESTC
Reinforcement Learning
Siying Wang
Siying Wang
University of Electronic Science and Technology of China
reinforcement learninmulti-agent reinforcement learningoffline-to-online reinforcement learning
Wenyu Chen
Wenyu Chen
Massachusetts Institute of Technology
optimizationstatistical learning
Y
Yang Zhou
School of Computer Science and Engineering, University of Electronic Science and Technology of China
Z
Zhitong Zhao
School of Computer Science and Engineering, University of Electronic Science and Technology of China
Zixuan Zhang
Zixuan Zhang
Georgia Institute of Technology
Machine Learning
R
Ruijie Zhang
School of Computer Science and Engineering, University of Electronic Science and Technology of China