π€ AI Summary
This work proposes the first vector-length-agnostic (VLA) design paradigm for quantum state-vector simulation on ARM processors supporting variable vector lengths. By integrating VLEN-adaptive memory layouts, load buffering, fine-grained loop control, and gate fusion techniques, the approach enables a single source code to achieve high-performance, portable simulation across diverse ARM platforms. The study introduces a systematic evaluation framework combining PMU events with novel quantitative metrics to assess vectorization efficacy. Experimental results demonstrate speedups of up to 4.5Γ, 2.5Γ, and 1.5Γ on the A64FX, Grace, and Graviton3 processors, respectively, confirming both the high performance and cross-platform portability of the proposed methodology.
π Abstract
ARM SVE and RISC-V RVV are emerging vector architectures in high-end processors that support vectorization of flexible vector length. In this work, we leverage an important workload for quantum computing, quantum state-vector simulations, to understand whether high-performance portability can be achieved in a vector-length agnostic (VLA) design. We propose a VLA design and optimization techniques critical for achieving high performance, including VLEN-adaptive memory layout adjustment, load buffering, fine-grained loop control, and gate fusion-based arithmetic intensity adaptation. We provide an implementation in Google's Qsim and evaluate five quantum circuits of up to 36 qubits on three ARM processors, including NVIDIA Grace, AWS Graviton3, and Fujitsu A64FX. By defining new metrics and PMU events to quantify vectorization activities, we draw generic insights for future VLA designs. Our single-source implementation of VLA quantum simulations achieves up to 4.5x speedup on A64FX, 2.5x speedup on Grace, and 1.5x speedup on Graviton.