An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MARL evaluation benchmarks (e.g., SMAC, GRF) focus predominantly on team-based games, suffer from limited task diversity, and operate almost exclusively in low-dimensional state spaces—lacking systematic evaluation of high-dimensional visual observations and real-world fully cooperative tasks (e.g., multi-robot coordination, warehouse scheduling, search-and-rescue, human-AI collaboration). Method: We introduce the first high-difficulty, fully cooperative MARL benchmark grounded in realistic collaborative scenarios, featuring native support for image-based observations. We propose PyMARLzoo+, an open-source framework enabling one-click integration of PettingZoo’s full suite with complex environments such as Overcooked. Our approach jointly trains CNNs or Vision Transformers with policy networks and supports distributed, efficient evaluation. Contribution/Results: Experiments reveal substantial performance degradation of mainstream MARL algorithms on our benchmark. PyMARLzoo+ unifies 12+ benchmarks across four categories of high-dimensional cooperative tasks and achieves a 3× speedup in evaluation throughput, establishing itself as a de facto standard in the MARL community.

Technology Category

Application Category

📝 Abstract
Multi-Agent Reinforcement Learning (MARL) has recently emerged as a significant area of research. However, MARL evaluation often lacks systematic diversity, hindering a comprehensive understanding of algorithms' capabilities. In particular, cooperative MARL algorithms are predominantly evaluated on benchmarks such as SMAC and GRF, which primarily feature team game scenarios without assessing adequately various aspects of agents' capabilities required in fully cooperative real-world tasks such as multi-robot cooperation and warehouse, resource management, search and rescue, and human-AI cooperation. Moreover, MARL algorithms are mainly evaluated on low dimensional state spaces, and thus their performance on high-dimensional (e.g., image) observations is not well-studied. To fill this gap, this paper highlights the crucial need for expanding systematic evaluation across a wider array of existing benchmarks. To this end, we conduct extensive evaluation and comparisons of well-known MARL algorithms on complex fully cooperative benchmarks, including tasks with images as agents' observations. Interestingly, our analysis shows that many algorithms, hailed as state-of-the-art on SMAC and GRF, may underperform standard MARL baselines on fully cooperative benchmarks. Finally, towards more systematic and better evaluation of cooperative MARL algorithms, we have open-sourced PyMARLzoo+, an extension of the widely used (E)PyMARL libraries, which addresses an open challenge from [TBG++21], facilitating seamless integration and support with all benchmarks of PettingZoo, as well as Overcooked, PressurePlate, Capture Target and Box Pushing.
Problem

Research questions and friction points this paper is trying to address.

Evaluates MARL in complex cooperative tasks
Addresses lack of diversity in MARL benchmarks
Assesses MARL performance on high-dimensional observations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended MARL benchmarking
High-dimensional image observations
Open-sourced PyMARLzoo+ library
🔎 Similar Papers
No similar papers found.