Demystifying ARM SME to Optimize General Matrix Multiplications

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing ARM linear algebra libraries fail to fully exploit the Scalable Matrix Extension (SME) architecture, particularly suffering from severe performance bottlenecks in large-scale General Matrix Multiplication (GEMM). To address this, we propose MpGEMM—the first open-source, high-performance GEMM library specifically designed for ARM SME. We conduct the first systematic microarchitectural characterization of SME and derive three key optimization principles: cache-aware blocking, dynamic on-the-fly transpose-and-pack, and fully register-resident tile-based microkernels. Leveraging multi-vector load instructions and fine-grained tile register scheduling, MpGEMM achieves a 1.23× speedup over Apple’s Accelerate framework on the Apple M4 Pro—outperforming leading open-source libraries. We further validate its effectiveness on realistic large-model workloads, including DeepSeek and LLaMA, demonstrating substantial end-to-end inference acceleration.

Technology Category

Application Category

📝 Abstract
General Matrix Multiplication (GEMM) is a critical kernel in high-performance computing and deep learning. While modern architectures like ARM's Scalable Matrix Extension (SME) introduce dedicated hardware for matrix operations, existing linear algebra libraries fail to fully exploit its potential, particularly for large matrices. This paper presents MpGEMM, an open-source library that leverages key architectural features of SME to optimize GEMM across multiple precisions. Through a systematic characterization of SME, we derive optimization guidelines that inform our design. MpGEMM employs cache-aware partitioning, efficient data packing with on-the-fly transposition, and specialized micro-kernels that utilize multi-vector loads and all available tile registers. Evaluated on an Apple M4 Pro with real-world workloads from DeepSeek and LLaMA, MpGEMM achieves an average speedup of 1.23x over the vendor-optimized Apple Accelerate library and significantly outperforms other open-source alternatives.
Problem

Research questions and friction points this paper is trying to address.

Optimizes GEMM for ARM SME architecture
Exploits SME features for large matrices
Improves performance over existing libraries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages ARM SME architectural features for GEMM optimization
Employs cache-aware partitioning and on-the-fly data packing
Uses specialized micro-kernels with multi-vector loads and tile registers
🔎 Similar Papers
No similar papers found.
C
Chencheng Deng
College of Computer Science and Technology,National University of Defense Technology, Changsha, China
W
Weiling Yang
College of Computer Science and Technology,National University of Defense Technology, Changsha, China
Jianbin Fang
Jianbin Fang
Associate Professor, National University of Defense Technology
High-Performance ComputingProgramming SystemsCompilersSoftware-Hardware Codesign
Dezun Dong
Dezun Dong
Professor, School of Computer Science, National University of Defense Technology
computer architecturehigh performance computinginterconnection networksmachine learning systems