🤖 AI Summary
Neural-network-based variational Monte Carlo (NNVMC) methods face significant challenges in efficient GPU deployment due to high computational and memory overheads. This work presents the first systematic workload-level analysis of NNVMC’s heterogeneous computing characteristics on GPUs. Employing a unified performance evaluation protocol that integrates GPU hardware performance counters, Roofline modeling, and arithmetic intensity analysis, we conduct an end-to-end empirical assessment of four representative models: PauliNet, FermiNet, Psiformer, and Orbformer. Our study reveals substantial compute-memory imbalance across different execution phases, with performance bottlenecks primarily stemming from low-arithmetic-intensity element-wise operations and frequent data movement. These findings provide critical insights for phase-aware scheduling, memory-centric optimizations, and hardware-software co-design strategies tailored to NNVMC workloads.
📝 Abstract
Neural Network Variational Monte Carlo (NNVMC) has emerged as a promising paradigm for solving quantum many-body problems by combining variational Monte Carlo with expressive neural-network wave-function ansätze. Although NNVMC can achieve competitive accuracy with favorable asymptotic scaling, practical deployment remains limited by high runtime and memory cost on modern graphics processing units (GPUs). Compared with language and vision workloads, NNVMC execution is shaped by physics-specific stages, including Markov-Chain Monte Carlo sampling, wave-function construction, and derivative/Laplacian evaluation, which produce heterogeneous kernel behavior and nontrivial bottlenecks. This paper provides a workload-oriented survey and empirical GPU characterization of four representative ansätze: PauliNet, FermiNet, Psiformer, and Orbformer. Using a unified profiling protocol, we analyze model-level runtime and memory trends and kernel-level behavior through family breakdown, arithmetic intensity, roofline positioning, and hardware utilization counters. The results show that end-to-end performance is often constrained by low-intensity elementwise and data-movement kernels, while the compute/memory balance varies substantially across ansätze and stages. Based on these findings, we discuss algorithm--hardware co-design implications for scalable NNVMC systems, including phase-aware scheduling, memory-centric optimization, and heterogeneous acceleration.