🤖 AI Summary
Existing multi-agent reinforcement learning (MARL) methods—such as MAT and ACE—employ sequential modeling but fail to explicitly account for the agent action decision order, a critical factor in cooperative dynamics. To address this, we propose AOAD-MAT, the first MARL framework that explicitly learns dynamic action decision ordering end-to-end. It introduces a novel auxiliary task—predicting the next agent to act—jointly optimized with the PPO objective, thereby modeling inter-agent action dependencies intrinsically. Built upon a Transformer-based actor-critic architecture, AOAD-MAT achieves significant performance gains over MAT and other baselines on SMAC and MAMuJoCo benchmarks. Our core contribution is the formalization of action timing as a learnable variable within MARL, demonstrating that explicit modeling of decision order substantially improves both sequential decision efficiency and cooperative quality.
📝 Abstract
Multi-agent reinforcement learning focuses on training the behaviors of multiple learning agents that coexist in a shared environment. Recently, MARL models, such as the Multi-Agent Transformer (MAT) and ACtion dEpendent deep Q-learning (ACE), have significantly improved performance by leveraging sequential decision-making processes. Although these models can enhance performance, they do not explicitly consider the importance of the order in which agents make decisions. In this paper, we propose an Agent Order of Action Decisions-MAT (AOAD-MAT), a novel MAT model that considers the order in which agents make decisions. The proposed model explicitly incorporates the sequence of action decisions into the learning process, allowing the model to learn and predict the optimal order of agent actions. The AOAD-MAT model leverages a Transformer-based actor-critic architecture that dynamically adjusts the sequence of agent actions. To achieve this, we introduce a novel MARL architecture that cooperates with a subtask focused on predicting the next agent to act, integrated into a Proximal Policy Optimization based loss function to synergistically maximize the advantage of the sequential decision-making. The proposed method was validated through extensive experiments on the StarCraft Multi-Agent Challenge and Multi-Agent MuJoCo benchmarks. The experimental results show that the proposed AOAD-MAT model outperforms existing MAT and other baseline models, demonstrating the effectiveness of adjusting the AOAD order in MARL.