🤖 AI Summary
This paper addresses the challenge of learning temporally coordinated multi-agent policies for multi-task settings under the centralized training with decentralized execution (CTDE) paradigm. To overcome the low sample efficiency and poor generalization of existing methods to diverse tasks, we propose ACC-MARL—a framework that models temporal tasks as finite-state automata to enable explicit task decomposition and agent coordination. It introduces a task-conditioned policy network and, within the CTDE framework, incorporates a value-function-driven online task assignment mechanism that dynamically optimizes role allocation during execution. Experiments demonstrate that ACC-MARL successfully emergently learns multi-step collaborative behaviors—such as cooperative door-opening and sequential unlocking—achieving significant improvements in task success rate, sample efficiency, and cross-task generalization.
📝 Abstract
We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL's feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.