Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of scaling cooperative multi-agent reinforcement learning, where policy gradient variance grows linearly with the number of agents (Θ(N)) due to cross-agent noise, severely degrading sample efficiency. To mitigate this, the authors introduce—for the first time—a differentiable system-level analytical model that generates noise-free, individualized guidance gradients. This approach decouples each agent’s learning signal, reducing gradient variance to O(1) and enabling policy gradient updates independent of the number of agents. While preserving game-theoretic equilibrium, the method achieves substantial gains in sample efficiency: in a heterogeneous cloud scheduling task with 200 agents, it converges within just 10 training epochs, whereas baseline algorithms such as MAPPO and IPPO fail to converge.

Technology Category

Application Category

📝 Abstract
Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from $N=5$ to $N=200$ -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.
Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning
cross-agent noise
policy gradient
sample complexity
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Descent-Guided Policy Gradient
multi-agent reinforcement learning
gradient variance reduction
scale-invariant sample complexity
analytical model guidance
🔎 Similar Papers
No similar papers found.
S
Shan Yang
Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore
Yang Liu
Yang Liu
Associate Professor, CEE and ISEM at NUS
Urban Mobility Operations and ManagementTraffic Congestion ManagementData-Driven Transportation