🤖 AI Summary
This work addresses the challenges in partially observable multi-agent reinforcement learning where determining when and with whom to communicate is difficult, and the impact of a single message on long-term rewards is hard to evaluate. The authors propose a utility-guided temporally grouped communication mechanism that dynamically forms soft groups every K steps via Gumbel-Softmax to select communication partners, while incorporating a group-aware critic to reduce policy gradient variance. The policy network employs a three-headed architecture—encoding actions, message sending, and intended receivers—and leverages counterfactual communication advantages for precise credit assignment. Implemented within a centralized training with decentralized execution (CTDE) framework, this approach significantly enhances coordination efficiency, reduces communication complexity and gradient variance, and maintains fully decentralized execution at test time.
📝 Abstract
Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning \emph{when} and \emph{who} to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce \textbf{SCoUT} (\textbf{S}calable \textbf{Co}mmunication via \textbf{U}tility-guided \textbf{T}emporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples \textit{soft} agent groups every \(K\) environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient-selection decisions. At execution time, all centralized training components are discarded and only the per-agent policy is run, preserving decentralized execution. Project website, videos and code: \hyperlink{https://scout-comm.github.io/}{https://scout-comm.github.io/}