🤖 AI Summary
This work addresses the scalability limitations of conventional neural-network quantum states with selected configuration interaction (NNQS-SCI), which suffer from communication bottlenecks and high computational overhead due to CPU–GPU hybrid architectures. To overcome these challenges, the authors propose the first fully GPU-accelerated NNQS-SCI framework, shifting the computational bottleneck entirely onto the device side. Key innovations include distributed load-balanced duplicate elimination, fine-grained CUDA kernels, GPU memory pooling, and streaming mini-batch processing. Evaluated on a 64-GPU A100 cluster, the framework achieves up to 2.32× end-to-end speedup and demonstrates strong-scaling parallel efficiency exceeding 90%, all while preserving chemical accuracy.
📝 Abstract
AI-driven methods have demonstrated considerable success in tackling the central challenge of accurately solving the Schrödinger equation for complex many-body systems. Among neural network quantum state (NNQS) approaches, the NNQS-SCI (Selected Configuration Interaction) method stands out as a state-of-the-art technique, recognized for its high accuracy and scalability. However, its application to larger systems is severely constrained by a hybrid CPU-GPU architecture. Specifically, centralized CPU-based global de-duplication creates a severe scalability barrier due to communication bottlenecks, while host-resident coupled-configuration generation induces prohibitive computational overheads. We introduce cuNNQS-SCI, a fully GPU-accelerated SCI framework designed to overcome these bottlenecks. cuNNQS-SCI first integrates a distributed, load-balanced global de-duplication algorithm to minimize redundancy and communication overhead at scale. To address compute limitations, it employs specialized, fine-grained CUDA kernels for exact coupled configuration generation. Finally, to break the single-GPU memory barrier exposed by this full acceleration, it incorporates a GPU memory-centric runtime featuring GPU-side pooling, streaming mini-batches, and overlapped offloading. This design enables much larger configuration spaces and shifts the bottleneck from host-side limitations back to on-device inference. Our evaluation demonstrates that cuNNQS-SCI fundamentally expands the scale of solvable problems. On an NVIDIA A100 cluster with 64 GPUs, cuNNQS-SCI achieves up to 2.32X end-to-end speedup over the highly-optimized NNQS-SCI baseline while preserving the same chemical accuracy. Furthermore, it demonstrates excellent distributed performance, maintaining over 90% parallel efficiency in strong scaling tests.