Fast GPU Linear Algebra via Compile Time Expression Fusion

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the inefficiencies of conventional GPU linear algebra libraries, which often fail to fully exploit hardware capabilities due to redundant memory accesses and runtime overhead. To overcome these limitations, the authors propose Bandicoot, a toolkit that leverages C++ template metaprogramming to perform expression fusion at compile time, thereby automatically generating highly optimized GPU kernels that saturate memory bandwidth—without relying on just-in-time compilation or runtime scheduling. Bandicoot provides an API compatible with Armadillo, facilitating straightforward migration of existing CPU codebases. Experimental results demonstrate that Bandicoot significantly outperforms PyTorch, TensorFlow, and JAX across multiple benchmarks, achieving substantial speedups in several scenarios.

Technology Category

Application Category

📝 Abstract

We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy transition for existing CPU-based codebases. Unlike other GPU-focused toolkits, Bandicoot uses template metaprogramming to generate fused GPU kernels directly at compile time, yielding efficient kernels that are often able to saturate memory bandwidth. This removes the need for runtime overhead or JIT infrastructure. Empirical results show that Bandicoot outperforms (sometimes by considerable margins) commonly-used linear algebra toolkits including PyTorch, TensorFlow, and JAX.

Problem

Research questions and friction points this paper is trying to address.

GPU linear algebra

compile-time fusion

memory bandwidth

ease of use

performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

compile-time fusion

template metaprogramming

GPU kernel optimization