Remembering the Markov Property in Cooperative MARL

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Current model-free multi-agent reinforcement learning (MARL) under the Dec-POMDP framework over-relies on fragile, co-adapted conventions rather than robust, observation- and memory-based Markov policies, resulting in poor generalization. Existing benchmarks inadequately test core Dec-POMDP assumptions, particularly observability and memory requirements. Method: The authors critique prevailing environment design, propose revised principles for collaborative task construction—emphasizing observation completeness and explicit memory-dependent reasoning—and introduce a novel testbed with systematic ablations. Contribution/Results: Experiments demonstrate that when tasks explicitly require joint observation-memory reasoning, agents learn robust, generalizable policies; conversely, standard environments induce brittle co-adaptation. This work identifies task design—not algorithmic architecture—as the primary bottleneck in MARL performance, providing both theoretical grounding and practical guidelines for developing principled, partial-observability-aware evaluation frameworks.

Technology Category

Application Category

📝 Abstract

Cooperative multi-agent reinforcement learning (MARL) is typically formalised as a Decentralised Partially Observable Markov Decision Process (Dec-POMDP), where agents must reason about the environment and other agents' behaviour. In practice, current model-free MARL algorithms use simple recurrent function approximators to address the challenge of reasoning about others using partial information. In this position paper, we argue that the empirical success of these methods is not due to effective Markov signal recovery, but rather to learning simple conventions that bypass environment observations and memory. Through a targeted case study, we show that co-adapting agents can learn brittle conventions, which then fail when partnered with non-adaptive agents. Crucially, the same models can learn grounded policies when the task design necessitates it, revealing that the issue is not a fundamental limitation of the learning models but a failure of the benchmark design. Our analysis also suggests that modern MARL environments may not adequately test the core assumptions of Dec-POMDPs. We therefore advocate for new cooperative environments built upon two core principles: (1) behaviours grounded in observations and (2) memory-based reasoning about other agents, ensuring success requires genuine skill rather than fragile, co-adapted agreements.

Problem

Research questions and friction points this paper is trying to address.

MARL algorithms rely on simple conventions, not Markov recovery

Co-adapted agents learn brittle conventions failing with non-adaptive partners

Modern MARL environments inadequately test Dec-POMDP core assumptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent approximators handle partial information

Co-adapting agents learn brittle conventions

Advocate observation-grounded memory-based reasoning

🔎 Similar Papers

YOLO-MARL: You Only LLM Once for Multi-agent Reinforcement Learning

2024-10-05arXiv.orgCitations: 1

Anthropic

$500,000—$850,000 USD

San Francisco, CA, USA

AI Research Scientist - FAIR Social Intelligence