LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low efficiency of sparse-dense matrix multiplication (SpMM) on Arm’s Scalable Matrix Extension (SME) architecture, this paper proposes LOOPS—a hybrid execution framework. LOOPS jointly schedules SME’s scalable matrix engine and NEON SIMD units, integrates a hybrid storage format combining row-wise CSR with vector-level blocked CSR (BCSR), and employs a lightweight performance model to enable two-level adaptive parallelization across both SME and SIMD resources. It supports FP64, FP32, and FP16 precisions. Evaluated on an Apple M4 Pro CPU, LOOPS achieves average speedups of 9.93× (FP32) and 14.4× (FP64) over TACO, and up to 71.3× over Armadillo; it also significantly outperforms the NVIDIA A100 GPU in energy efficiency. The core contributions are: (1) a novel SME-SIMD co-scheduling mechanism; (2) a hybrid sparse layout design optimized for SME’s microarchitecture; and (3) an adaptive SpMM optimization paradigm specifically tailored for Arm SME.

Technology Category

Application Category

📝 Abstract
Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in both scientific computing and emerging graph learning workloads. The recent Armv9 architecture introduces Scalable Matrix Extension (SME), enabling tile-based matrix operations with high throughput. However, effectively exploiting both SME and traditional SIMD resources for unstructured sparse workloads remains an open challenge. To address this, we propose LOOPS, a hybrid execution framework that combines row-wise CSR-part with vector-wise BCSR-part layout, enabling cooperative utilization of vector instructions (NEON) and Scalable Matrix Extension (SME) resources. LOOPS supports multi-precision SpMM across FP64, FP32, and FP16 via an adaptive two-level parallelization scheme guided by a lightweight performance model. Experimental results on the entire SuiteSparse on an Apple's M4Pro CPU show that LOOPS achieves average speedups of 9.93$ imes$ (FP32)/14.4$ imes$ (FP64) against the CPU baseline TACO and 71.3$ imes$ (FP32)/54.8$ imes$ (FP64) with respect to Armadillo. A comparison of LOOPS running on the same CPU with two GPU methods (cuSPARSE, Magicube) executed on an NVIDIA A100 GPU show average speedups for LOOPS between 19.8$ imes$ and 33.5$ imes$, depending on the precision. Notably, LOOPS delivers significantly better energy efficiency than the GPU codes on the A100 GPU.
Problem

Research questions and friction points this paper is trying to address.

Optimizing sparse matrix multiplication on Arm SME architectures
Combining SME and SIMD resources for unstructured sparse workloads
Achieving high performance and energy efficiency in SpMM operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid execution framework combining CSR and BCSR layouts
Adaptive two-level parallelization scheme for multi-precision SpMM
Cooperative utilization of NEON vector and SME matrix instructions
🔎 Similar Papers
No similar papers found.
K
Kelun Lei
Beihang University, Beijing, China
H
Hailong Yang
Beihang University, Beijing, China
K
Kaige Zhang
Beihang University, Beijing, China
K
Kejie Ma
Beihang University, Beijing, China
Y
Yiqing Wang
Beihang University, Beijing, China
Xin You
Xin You
Beihang University
Performance Tool、HPC
Yufan Xu
Yufan Xu
Uber Technologies.
CompilersHPCGPUProgram Analysis
E
E. Quintana-Ortí
Universitat Politècnica de València
Zhongzhi Luan
Zhongzhi Luan
Beihang University
Y
Yi Liu
Beihang University, Beijing, China
D
Depei Qian
Beihang University, Beijing, China