🤖 AI Summary
This work proposes a tuning-free General Matrix Multiplication (GEMM) approach based on generalized Hilbert space-filling curves, addressing the performance instability of conventional GEMM across diverse hardware architectures and matrix shapes. By leveraging high-locality computation partitioning, cache-oblivious blocking, and optimized tensor layouts, the method achieves consistently high performance independent of both platform and operand dimensions. It further integrates communication-avoiding strategies to minimize data movement, ensuring communication optimality. The implementation is remarkably concise and demonstrates superior performance over leading vendor libraries across a wide range of CPU platforms and GEMM problem shapes, achieving up to a 2× geometric mean speedup while maintaining both portability and communication efficiency.
📝 Abstract
General Matrix Multiplication (GEMM) is the cornerstone of Deep Learning and HPC workloads; accordingly, academia and industry have heavily optimized this kernel. Modern platforms with matrix multiplication accelerators exhibit high FLOP/Byte machine balance, which makes implementing optimal matrix multiplication challenging. On modern CPU platforms with matrix engines, state-of-the-art vendor libraries tune input tensor layouts, parallelization schemes, and cache blocking to minimize data movement across the memory hierarchy and maximize throughput. However, the best settings for these parameters depend strongly on the target platform (number of cores, memory hierarchy, cache sizes) and on the shapes of the matrices, making exhaustive tuning infeasible; in practice this leads to performance"glass jaws". In this work we revisit space filling curves (SFC) to alleviate the problem of this cumbersome tuning. SFC convert multi-dimensional coordinates (e.g. 2D) into a single dimension (1D), keeping nearby points in the high-dimensional space close in the 1D order. We partition the Matrix Multiplication computation space using recent advancements in generalized SFC (Generalized Hilbert Curves), and we obtain platform-oblivious and shape-oblivious matrix-multiplication schemes that exhibit inherently high degree of data locality. Furthermore, we extend the SFC-based work partitioning to implement Communication-Avoiding (CA) algorithms that replicate the input tensors and provably minimize communication/data-movement on the critical path. The integration of CA-algorithms is seamless and yields compact code (~30 LOC), yet it achieves state-of-the-art results on multiple CPU platforms, outperforming vendor libraries by up to 2x(geometric-mean speedup) for a range of GEMM shapes.