🤖 AI Summary
Tensor accelerators (e.g., NVIDIA Tensor Cores) are increasingly prevalent in CPUs and GPUs, yet their programmability remains limited: existing kernel libraries target only traditional ML and scientific computing workloads, failing to support non-ML linear matrix transform workloads such as image processing. This paper proposes an equivalence-saturation–based flexible tensor instruction selection mechanism, enabling general-purpose, schedulable compilation for tensor hardware. Integrating with the Halide domain-specific language and compiler, our approach retains full compatibility with existing scheduling primitives while substantially broadening the programmability of tensor accelerators. Evaluated on an NVIDIA RTX 4070, our framework achieves a 6.1× speedup on image processing pipelines—including downsampling—demonstrating, for the first time, systematic performance acceleration of tensor hardware in non-ML domains.
📝 Abstract
Tensor accelerators now represent a growing share of compute resources in modern CPUs and GPUs. However, they are hard to program, leading developers to use vendor-provided kernel libraries that support tensor accelerators. As a result, the usage of tensor accelerators is limited to the provided interface, mainly designed for traditional ML and scientific computing workloads. In this paper, we show that tensor accelerators can improve the performance of applications beyond simple variants of MatMul. For example, many image processing pipelines are linear transformations over matrices in disguise and can therefore utilize such specialized hardware. This is nonetheless hindered by the difficulties in programming tensor accelerators. We tackle this problem with compiler-based techniques. We use the Halide user-schedulable language and express operations as Halide algorithms succinctly. To this end, we implement a flexible tensor instruction selector based on equality saturation. The tensor instruction selector supports both CPU- and GPU-attached tensor accelerators and works with existing scheduling operations (e.g., producer-consumer fusion). Together, this enables developers to write diverse accelerator-leveraging applications in a few dozen lines. Using our system, we demonstrate the potential of tensor accelerators beyond their traditional domains. We implement several image processing pipelines (e.g., filtering, resampling, and denoising) in our system and evaluate them against non-accelerator-leveraging baselines. We show that these pipelines can achieve significant speedups. For example, a downsampling routine is sped up by $6.1 imes$ by utilizing Tensor Cores on an Nvidia RTX 4070 GPU.