🤖 AI Summary
GPU GEMM kernels traditionally rely on time-consuming runtime autotuning to determine optimal launch parameters. Method: This paper proposes an analytical modeling approach that explicitly and jointly models the GPU memory hierarchy, code generation logic, and data layout constraints—enabling prediction of near-optimal kernel configurations solely from matrix dimensions, hardware architectural features, and tiling strategies. Implemented as a lightweight Triton-based GEMM framework, it eliminates runtime tuning entirely. Results: Across multiple GPU generations and GEMM problem sizes, the method achieves ≥95% of the performance attained by state-of-the-art autotuners, while reducing tuning overhead to zero. Its core contribution is the first interpretable, tuning-free analytical model for GEMM parameter prediction—significantly improving deployment efficiency and cross-architecture generalizability.
📝 Abstract
We present tritonBLAS, a fast and deterministic analytical model that uses architectural parameters like the cache hierarchy, and relative code and data placement to generate performant GPU GEMM kernels. tritonBLAS explicitly models the relationship between architectural topology, matrix shapes, and algorithmic blocking behavior to predict near-optimal configurations without runtime autotuning. Based on this model, we developed and implemented a lightweight GEMM framework entirely within Triton. We evaluate the performance of tritonBLAS across a diverse set of GEMM problem sizes on modern GPUs. tritonBLAS achieves over 95% of the performance of autotuning solutions, while reducing autotuning time to zero. This makes tritonBLAS a practical drop-in replacement for empirical tuning in production HPC and ML workloads.