🤖 AI Summary
This work addresses the inefficiencies of conventional GPU linear algebra libraries, which often fail to fully exploit hardware capabilities due to redundant memory accesses and runtime overhead. To overcome these limitations, the authors propose Bandicoot, a toolkit that leverages C++ template metaprogramming to perform expression fusion at compile time, thereby automatically generating highly optimized GPU kernels that saturate memory bandwidth—without relying on just-in-time compilation or runtime scheduling. Bandicoot provides an API compatible with Armadillo, facilitating straightforward migration of existing CPU codebases. Experimental results demonstrate that Bandicoot significantly outperforms PyTorch, TensorFlow, and JAX across multiple benchmarks, achieving substantial speedups in several scenarios.
📝 Abstract
We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy transition for existing CPU-based codebases. Unlike other GPU-focused toolkits, Bandicoot uses template metaprogramming to generate fused GPU kernels directly at compile time, yielding efficient kernels that are often able to saturate memory bandwidth. This removes the need for runtime overhead or JIT infrastructure. Empirical results show that Bandicoot outperforms (sometimes by considerable margins) commonly-used linear algebra toolkits including PyTorch, TensorFlow, and JAX.