🤖 AI Summary
To address the low efficiency of sparse-dense matrix multiplication (SpMM) on Arm’s Scalable Matrix Extension (SME) architecture, this paper proposes LOOPS—a hybrid execution framework. LOOPS jointly schedules SME’s scalable matrix engine and NEON SIMD units, integrates a hybrid storage format combining row-wise CSR with vector-level blocked CSR (BCSR), and employs a lightweight performance model to enable two-level adaptive parallelization across both SME and SIMD resources. It supports FP64, FP32, and FP16 precisions. Evaluated on an Apple M4 Pro CPU, LOOPS achieves average speedups of 9.93× (FP32) and 14.4× (FP64) over TACO, and up to 71.3× over Armadillo; it also significantly outperforms the NVIDIA A100 GPU in energy efficiency. The core contributions are: (1) a novel SME-SIMD co-scheduling mechanism; (2) a hybrid sparse layout design optimized for SME’s microarchitecture; and (3) an adaptive SpMM optimization paradigm specifically tailored for Arm SME.
📝 Abstract
Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in both scientific computing and emerging graph learning workloads. The recent Armv9 architecture introduces Scalable Matrix Extension (SME), enabling tile-based matrix operations with high throughput. However, effectively exploiting both SME and traditional SIMD resources for unstructured sparse workloads remains an open challenge. To address this, we propose LOOPS, a hybrid execution framework that combines row-wise CSR-part with vector-wise BCSR-part layout, enabling cooperative utilization of vector instructions (NEON) and Scalable Matrix Extension (SME) resources. LOOPS supports multi-precision SpMM across FP64, FP32, and FP16 via an adaptive two-level parallelization scheme guided by a lightweight performance model. Experimental results on the entire SuiteSparse on an Apple's M4Pro CPU show that LOOPS achieves average speedups of 9.93$ imes$ (FP32)/14.4$ imes$ (FP64) against the CPU baseline TACO and 71.3$ imes$ (FP32)/54.8$ imes$ (FP64) with respect to Armadillo. A comparison of LOOPS running on the same CPU with two GPU methods (cuSPARSE, Magicube) executed on an NVIDIA A100 GPU show average speedups for LOOPS between 19.8$ imes$ and 33.5$ imes$, depending on the precision. Notably, LOOPS delivers significantly better energy efficiency than the GPU codes on the A100 GPU.