🤖 AI Summary
This work addresses the redundant memory overhead in existing BLAS APIs caused by mandatory data packing and unpacking during serialized GEMM operations. The authors propose a BLAS-compatible decomposition method for GEMM kernels that, for the first time, enables propagation of data layouts across consecutive GEMM calls, thereby eliminating repeated intermediate data transformations while preserving semantic correctness. Implemented in C++ and integrated into a pure BLAS calling chain, the approach is evaluated on both x86 (AVX-512) and RISC-V (RVV 1.0) architectures. Experimental results demonstrate an average 2.25× speedup over OpenBLAS on Intel x86 platforms, achieving performance comparable to Intel MKL, with practical efficacy further validated in Llama-3.2 inference workloads.
📝 Abstract
In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources.
This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL.
We demonstrate the practicality of the approach beyond microbenchmarks by implementing a standalone C++ version of the Llama-3.2 inference path using exclusively BLAS-level GEMM calls. These results confirm that leveraging data layout propagation between operations can significantly boost performance.