cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection withNeural Network QQantum States

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

263K/year
🤖 AI Summary
This work addresses the scalability limitations of conventional neural-network quantum states with selected configuration interaction (NNQS-SCI), which suffer from communication bottlenecks and high computational overhead due to CPU–GPU hybrid architectures. To overcome these challenges, the authors propose the first fully GPU-accelerated NNQS-SCI framework, shifting the computational bottleneck entirely onto the device side. Key innovations include distributed load-balanced duplicate elimination, fine-grained CUDA kernels, GPU memory pooling, and streaming mini-batch processing. Evaluated on a 64-GPU A100 cluster, the framework achieves up to 2.32× end-to-end speedup and demonstrates strong-scaling parallel efficiency exceeding 90%, all while preserving chemical accuracy.

Technology Category

Application Category

📝 Abstract
AI-driven methods have demonstrated considerable success in tackling the central challenge of accurately solving the Schrödinger equation for complex many-body systems. Among neural network quantum state (NNQS) approaches, the NNQS-SCI (Selected Configuration Interaction) method stands out as a state-of-the-art technique, recognized for its high accuracy and scalability. However, its application to larger systems is severely constrained by a hybrid CPU-GPU architecture. Specifically, centralized CPU-based global de-duplication creates a severe scalability barrier due to communication bottlenecks, while host-resident coupled-configuration generation induces prohibitive computational overheads. We introduce cuNNQS-SCI, a fully GPU-accelerated SCI framework designed to overcome these bottlenecks. cuNNQS-SCI first integrates a distributed, load-balanced global de-duplication algorithm to minimize redundancy and communication overhead at scale. To address compute limitations, it employs specialized, fine-grained CUDA kernels for exact coupled configuration generation. Finally, to break the single-GPU memory barrier exposed by this full acceleration, it incorporates a GPU memory-centric runtime featuring GPU-side pooling, streaming mini-batches, and overlapped offloading. This design enables much larger configuration spaces and shifts the bottleneck from host-side limitations back to on-device inference. Our evaluation demonstrates that cuNNQS-SCI fundamentally expands the scale of solvable problems. On an NVIDIA A100 cluster with 64 GPUs, cuNNQS-SCI achieves up to 2.32X end-to-end speedup over the highly-optimized NNQS-SCI baseline while preserving the same chemical accuracy. Furthermore, it demonstrates excellent distributed performance, maintaining over 90% parallel efficiency in strong scaling tests.
Problem

Research questions and friction points this paper is trying to address.

neural network quantum states
selected configuration interaction
GPU acceleration
scalability bottleneck
many-body systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU acceleration
neural network quantum states
selected configuration interaction
distributed deduplication
CUDA kernels
🔎 Similar Papers
D
Daran Sun
Institute of Computing Technology, Chinese Academy of Sciences
B
Bowen Kan
Institute of Computing Technology, Chinese Academy of Sciences
H
Haoquan Long
Institute of Computing Technology, Chinese Academy of Sciences
H
Hairui Zhao
Institute of Computing Technology, Chinese Academy of Sciences
H
Haoxu Li
Institute of Computing Technology, Chinese Academy of Sciences
Yicheng Liu
Yicheng Liu
Tsinghua University
Robotics
P
Pengyu Zhou
Institute of Computing Technology, Chinese Academy of Sciences
A
Ankang Feng
University of Science and Technology of China
Wenjing Huang
Wenjing Huang
RAND Corporation
PsychometricsStructural Equation ModelingItem Response TheoryCyber Security
Y
Yida Gu
Institute of Computing Technology, Chinese Academy of Sciences
Zhenyu Li
Zhenyu Li
University of Science and Technology of China
Electronic structure calculationsMolecular simulationMaterials science
Honghui Shang
Honghui Shang
University of Science and Technology of China
Condensed-matter theoryQuantum Chemisty
Yunquan Zhang
Yunquan Zhang
Professor of Institute of Computing Technology, CAS
parallel computingparallel programmingparallel computational model
Dingwen Tao
Dingwen Tao
Chinese Academy of Sciences, IEEE/ACM Senior Member
High Performance ComputingData ReductionDeep LearningSystems for MLGPU
N
Ninghui Sun
Institute of Computing Technology, Chinese Academy of Sciences
G
Guangming Tan
Institute of Computing Technology, Chinese Academy of Sciences