Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction

πŸ“… 2025-07-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Equivariant graph neural networks (eGNNs) face severe memory bottlenecks in large-scale electronic structure prediction due to dense graph connectivity induced by long-range atomic orbital interactions (>10 Γ…), leading to GPU memory overflow. Method: We propose a distributed eGNN architecture coupled with an adaptive graph partitioning strategy to minimize inter-GPU embedding communication overhead. The approach integrates DFT data-driven training, GPU-direct communication optimization, and a scalable parallel graph learning framework. Contribution/Results: This work achieves the first efficient strong and weak scaling of eGNNs across hundreds of GPUs: 87% parallel efficiency on 128 GPUs (strong scaling) and scalable weak scaling up to 512 GPUs on the Alps supercomputer. It enables end-to-end training and inference for systems ranging from 3,000 to 190,000 atoms. To our knowledge, this is the first kilo-GPU–scale eGNN solution enabling high-accuracy, scalable electronic structure modeling of extended defects, interfaces, and disordered phases.

Technology Category

Application Category

πŸ“ Abstract
Equivariant Graph Neural Networks (eGNNs) trained on density-functional theory (DFT) data can potentially perform electronic structure prediction at unprecedented scales, enabling investigation of the electronic properties of materials with extended defects, interfaces, or exhibiting disordered phases. However, as interactions between atomic orbitals typically extend over 10+ angstroms, the graph representations required for this task tend to be densely connected, and the memory requirements to perform training and inference on these large structures can exceed the limits of modern GPUs. Here we present a distributed eGNN implementation which leverages direct GPU communication and introduce a partitioning strategy of the input graph to reduce the number of embedding exchanges between GPUs. Our implementation shows strong scaling up to 128 GPUs, and weak scaling up to 512 GPUs with 87% parallel efficiency for structures with 3,000 to 190,000 atoms on the Alps supercomputer.
Problem

Research questions and friction points this paper is trying to address.

Predicting electronic structures for large-scale materials with eGNNs
Handling memory limits for densely connected atomic orbital graphs
Enabling distributed training on large structures with GPU communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed eGNN for large-scale prediction
GPU communication optimizes embedding exchanges
Graph partitioning enhances parallel efficiency
πŸ”Ž Similar Papers
No similar papers found.