š¤ AI Summary
In symmetric, homogeneous multi-agent reinforcement learning settings, shared deterministic policies often fail to induce role differentiation and effective coordination. This work proposes the Diamond Attention architecture, which uniquely integrates structured randomness into the attention mechanism by assigning each agent a transient scalar random number to determine a ranking and masking attention between low-ranked agents, thereby enabling a one-round communication protocol based on random bits for coordination. The approach supports zero-shot deployment across teams of arbitrary size and generalizes seamlessly to varying numbers of agents and environments without retraining. Experiments demonstrate its efficacy: it achieves a success rate of 1.0 in the XOR game (versus ~0.5 for baselines), enables zero-shot generalization from N=4 to Nā[2,8] in control tasks, and significantly outperforms standard methods in cross-scenario SMACLite benchmarks, validating the utility of structured randomness.
š Abstract
Full parameter sharing is standard in cooperative multi-agent reinforcement learning (MARL) for homogeneous agents. Under permutation-symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness. We propose Diamond Attention, a cross-attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower-ranked peers from agent-to-agent attention while leaving task attention fully unmasked. This realizes a random-bit coordination protocol in a single broadcast round, and the set-based attention enables zero-shot deployment to teams of different sizes. We evaluate across three regimes that isolate when structured randomness matters. On the perfectly symmetric XOR game, our method achieves $1.0$ success while all deterministic baselines plateau near $0.5$. On control coordination tasks, a policy trained on $N=4$ generalizes zero-shot to $N \in [2,8]$. On SMACLite cross-scenario transfer, we achieve zero-shot transfer where standard baselines cannot transfer due to structural limitations. Furthermore, replacing the structured mask with standard dropout-based randomness results in a 0\% win rate, confirming that protocol-space structure, not stochastic noise, is the operative ingredient. https://anonymous.4open.science/r/randomness-137A/