Pushing Tensor Accelerators Beyond MatMul in a User-Schedulable Language

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Tensor accelerators (e.g., NVIDIA Tensor Cores) are increasingly prevalent in CPUs and GPUs, yet their programmability remains limited: existing kernel libraries target only traditional ML and scientific computing workloads, failing to support non-ML linear matrix transform workloads such as image processing. This paper proposes an equivalence-saturation–based flexible tensor instruction selection mechanism, enabling general-purpose, schedulable compilation for tensor hardware. Integrating with the Halide domain-specific language and compiler, our approach retains full compatibility with existing scheduling primitives while substantially broadening the programmability of tensor accelerators. Evaluated on an NVIDIA RTX 4070, our framework achieves a 6.1× speedup on image processing pipelines—including downsampling—demonstrating, for the first time, systematic performance acceleration of tensor hardware in non-ML domains.

Technology Category

Application Category

📝 Abstract
Tensor accelerators now represent a growing share of compute resources in modern CPUs and GPUs. However, they are hard to program, leading developers to use vendor-provided kernel libraries that support tensor accelerators. As a result, the usage of tensor accelerators is limited to the provided interface, mainly designed for traditional ML and scientific computing workloads. In this paper, we show that tensor accelerators can improve the performance of applications beyond simple variants of MatMul. For example, many image processing pipelines are linear transformations over matrices in disguise and can therefore utilize such specialized hardware. This is nonetheless hindered by the difficulties in programming tensor accelerators. We tackle this problem with compiler-based techniques. We use the Halide user-schedulable language and express operations as Halide algorithms succinctly. To this end, we implement a flexible tensor instruction selector based on equality saturation. The tensor instruction selector supports both CPU- and GPU-attached tensor accelerators and works with existing scheduling operations (e.g., producer-consumer fusion). Together, this enables developers to write diverse accelerator-leveraging applications in a few dozen lines. Using our system, we demonstrate the potential of tensor accelerators beyond their traditional domains. We implement several image processing pipelines (e.g., filtering, resampling, and denoising) in our system and evaluate them against non-accelerator-leveraging baselines. We show that these pipelines can achieve significant speedups. For example, a downsampling routine is sped up by $6.1 imes$ by utilizing Tensor Cores on an Nvidia RTX 4070 GPU.
Problem

Research questions and friction points this paper is trying to address.

Enabling tensor accelerators for diverse applications beyond traditional MatMul operations
Overcoming programming difficulties of tensor accelerators using compiler-based techniques
Demonstrating performance improvements for image processing pipelines via tensor hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiler-based techniques using Halide language
Flexible tensor instruction selector with equality saturation
Supports CPU and GPU tensor accelerators with scheduling
🔎 Similar Papers
No similar papers found.
Y
Yihong Zhang
University of Washington, USA
D
Derek Gerstmann
Adobe, USA
Andrew Adams
Andrew Adams
MIT
Image ProcessingComputational Photography
M
Maaz Bin Safeer Ahmad
Adobe, USA