๐ค AI Summary
This work addresses the challenge of coordinating large-scale edge devices in fully decentralized grid deployments while respecting the physical constraints of three-phase AC distribution networks. The authors propose a gradient-driven multi-agent proximal learning framework, wherein each agent independently trains a neural network policy using only local observations for online decision-makingโwithout inter-agent communication or parameter sharing. By embedding a differentiable three-phase power flow model and leveraging implicit differentiation, constraint violations are precisely backpropagated to policy parameters. Innovatively, a proximal surrogate is constructed in action space to reuse environmental gradients, circumventing conventional operations in probability distribution space and substantially accelerating training. Evaluated on the IEEE 123-node system, the method yields a low-constraint-violation decentralized policy within 15 minutes on a single GPU, achieving 3โ5ร faster training than self-supervised baselines and outperforming existing approaches in both operational cost and constraint satisfaction.
๐ Abstract
Coordinating large populations of grid-edge devices requires learning methods that remain fully decentralised in deployment while still respecting three-phase AC distribution-network physics. This paper proposes gradient-based multi-agent proximal learning (GradMAP) to address this challenge. GradMAP trains independent neural-network policies for each agent without any parameter sharing, and each agent uses only its own local observation for online decision-making without communication. During offline training, GradMAP embeds a differentiable three-phase AC power-flow model in a primal-dual learning loop and uses implicit differentiation to propagate exact network-constraint violations to update the policy parameters. To speed up training, GradMAP reuses expensive environment gradients through a proximal surrogate within a trust region defined in the more direct policy-output (action) space, instead of the probability distribution space used in other works, such as PPO. In case studies with 1,000 agents managing batteries, heat pumps, and controllable generators on the IEEE 123-bus feeder, GradMAP learns decentralised policies that minimise three-phase AC load-flow constraint violations within 15 minutes of training on a single workstation-class NVIDIA RTX PRO 5000 Blackwell 48GB GPU. This is a 3--5x training speed-up over gradient-based self-supervised learning benchmarks and substantially better training efficiency than multi-agent reinforcement-learning benchmarks. In out-of-sample tests, GradMAP also delivers among the lowest operating cost and constraint violations.