🤖 AI Summary
Current model-free multi-agent reinforcement learning (MARL) under the Dec-POMDP framework over-relies on fragile, co-adapted conventions rather than robust, observation- and memory-based Markov policies, resulting in poor generalization. Existing benchmarks inadequately test core Dec-POMDP assumptions, particularly observability and memory requirements. Method: The authors critique prevailing environment design, propose revised principles for collaborative task construction—emphasizing observation completeness and explicit memory-dependent reasoning—and introduce a novel testbed with systematic ablations. Contribution/Results: Experiments demonstrate that when tasks explicitly require joint observation-memory reasoning, agents learn robust, generalizable policies; conversely, standard environments induce brittle co-adaptation. This work identifies task design—not algorithmic architecture—as the primary bottleneck in MARL performance, providing both theoretical grounding and practical guidelines for developing principled, partial-observability-aware evaluation frameworks.
📝 Abstract
Cooperative multi-agent reinforcement learning (MARL) is typically formalised as a Decentralised Partially Observable Markov Decision Process (Dec-POMDP), where agents must reason about the environment and other agents' behaviour. In practice, current model-free MARL algorithms use simple recurrent function approximators to address the challenge of reasoning about others using partial information. In this position paper, we argue that the empirical success of these methods is not due to effective Markov signal recovery, but rather to learning simple conventions that bypass environment observations and memory. Through a targeted case study, we show that co-adapting agents can learn brittle conventions, which then fail when partnered with non-adaptive agents. Crucially, the same models can learn grounded policies when the task design necessitates it, revealing that the issue is not a fundamental limitation of the learning models but a failure of the benchmark design. Our analysis also suggests that modern MARL environments may not adequately test the core assumptions of Dec-POMDPs. We therefore advocate for new cooperative environments built upon two core principles: (1) behaviours grounded in observations and (2) memory-based reasoning about other agents, ensuring success requires genuine skill rather than fragile, co-adapted agreements.