๐ค AI Summary
Plane-wave density functional theory (PW-DFT) struggles to scale to ten-thousand-atom ab initio simulations on domestic Sunway supercomputers due to the severe memory constraint of only 16 GB per node.
Method: We propose a full-stack, architecture-aware parallel optimization framework for PW-DFT tailored to the Sunway many-core architecture, integrating MPI+OpenMP hybrid parallelism, sparse fast Fourier transforms (FFT), adaptive k-point sampling, low-rank density matrix compression, and customized many-core vectorization.
Contribution/Results: Our approach achieves, for the first time, a PW-DFT calculation on a 16,384-atom system within a single 16-GB-memory nodeโsetting a new record for atomic-scale capacity in plane-wave methods. On a 4,096-silicon-atom benchmark, it delivers a 64.8ร speedup over baseline implementations. This work overcomes both memory and computational bottlenecks of PW-DFT on indigenous supercomputing platforms and establishes a scalable, high-performance implementation pathway for large-scale materials simulations.
๐ Abstract
First-principles density functional theory (DFT) with plane wave (PW) basis set is the most widely used method in quantum mechanical material simulations due to its advantages in accuracy and universality. However, a perceived drawback of PW-based DFT calculations is their substantial computational cost and memory usage, which currently limits their ability to simulate large-scale complex systems containing thousands of atoms. This situation is exacerbated in the new Sunway supercomputer, where each process is limited to a mere 16 GB of memory. Herein, we present a novel parallel implementation of plane wave density functional theory on the new Sunway supercomputer (PWDFT-SW). PWDFT-SW fully extracts the benefits of Sunway supercomputer by extensively refactoring and calibrating our algorithms to align with the system characteristics of the Sunway system. Through extensive numerical experiments, we demonstrate that our methods can substantially decrease both computational costs and memory usage. Our optimizations translate to a speedup of 64.8x for a physical system containing 4,096 silicon atoms, enabling us to push the limit of PW-based DFT calculations to large-scale systems containing 16,384 carbon atoms.