A Flexible Instruction Set Architecture for Efficient GEMMs

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing matrix instruction set architectures (ISAs) suffer from low efficiency on small, irregular matrices in GEMM operations and lack flexibility for diverse data formats and deep learning workloads (e.g., convolutions, Transformers). This paper proposes a novel decoupled matrix ISA that fully separates instruction-set specification from microarchitectural implementation. Our design supports three-dimensional vectorization over M, N, and K dimensions, maintains backward compatibility with existing vector ISAs, and introduces only a minimal set of new instructions and a 64-bit control register—enabling low-overhead, flexible extensibility. Key microarchitectural techniques include Matrix Tiling Extension (MTE), vector register reuse, dynamic tiling configuration, and SIMD co-execution. Experimental evaluation demonstrates a 1.35× speedup in GEMM performance over the state-of-the-art matrix ISA and significant improvements in computational efficiency across core kernels of mainstream AI models.

Technology Category

Application Category

📝 Abstract
GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the Matrix Tile Extension (MTE), the first matrix ISA that completely decouples the instruction set architecture from the microarchitecture and seamlessly interacts with existing vector ISAs. MTE incurs minimal implementation overhead since it only requires a few additional instructions and a 64-bit Control Status Register (CSR) to keep its state. Specifically, MTE can i) vectorize GEMMs across the three dimensions M, N, and K; ii) leverage the capacity of the existing vector register file; and iii) decouple the tile shape from the underlying microarchitecture. MTE achieves speed-ups of 1.35x over the best state-of-the-art matrix ISA.
Problem

Research questions and friction points this paper is trying to address.

Optimizing GEMM performance for small, tall, skinny matrices
Overcoming rigidity in current matrix ISAs for data formats
Enhancing convolution and transformer model efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples ISA from microarchitecture completely
Uses minimal additional instructions and CSR
Vectorizes GEMMs across M, N, K dimensions
🔎 Similar Papers
No similar papers found.
A
Alexandre de Limas Santana
Barcelona Supercomputing Center, Universitat Politècnica de Catalunya
A
Adrià Armejach Sanosa
Barcelona Supercomputing Center, Universitat Politècnica de Catalunya
F
Francesc Martinez
Barcelona Supercomputing Center, Universitat Politècnica de Catalunya
Erich Focht
Erich Focht
NEC
HPCComputer ArchitecturesAIhigh energy physics
Marc Casas
Marc Casas
Leading Researcher, Barcelona Supercomputing Center
High Performance ComputingComputer Architecture