🤖 AI Summary
This work addresses the limitation of existing value decomposition methods, which neglect the realistic connectivity constraints inherent in wireless communication and thus struggle to support effective multi-agent coordination. To overcome this, the authors propose CLOVER, a novel framework that, for the first time, incorporates communication graphs generated by stochastic wireless channels as relational inductive biases into a centralized value mixer. By leveraging graph neural networks and a permutation-equivariant hypernetwork, CLOVER generates node-specific weights that enable communication topology-aware credit assignment. The method guarantees permutation invariance and monotonicity, exhibits strictly stronger representational capacity than QMIX-class approaches, and supports end-to-end differentiable training. Empirical results on Predator-Prey and Lumberjacks benchmarks demonstrate that CLOVER significantly outperforms baselines such as VDN and QMIX, achieving faster convergence, higher performance, and the ability to adaptively learn effective communication strategies.
📝 Abstract
Cooperation in multi-agent reinforcement learning (MARL) benefits from inter-agent communication, yet most approaches assume idealized channels and existing value decomposition methods ignore who successfully shared information with whom. We propose CLOVER, a cooperative MARL framework whose centralized value mixer is conditioned on the communication graph realized under a realistic wireless channel. This graph introduces a relational inductive bias into value decomposition, constraining how individual utilities are mixed based on the realized communication structure. The mixer is a GNN with node-specific weights generated by a Permutation-Equivariant Hypernetwork: multi-hop propagation along communication edges reshapes credit assignment so that different topologies induce different mixing. We prove this mixer is permutation invariant, monotonic (preserving the IGM condition), and strictly more expressive than QMIX-style mixers. To handle realistic channels, we formulate an augmented MDP isolating stochastic channel effects from the agent computation graph, and employ a stochastic receptive field encoder for variable-size message sets, enabling end-to-end differentiable training. On Predator-Prey and Lumberjacks benchmarks under p-CSMA wireless channels, CLOVER consistently improves convergence speed and final performance over VDN, QMIX, TarMAC+VDN, and TarMAC+QMIX. Behavioral analysis confirms agents learn adaptive signaling and listening strategies, and ablations isolate the communication-graph inductive bias as the key source of improvement.