🤖 AI Summary
Existing tensor compilers struggle to fuse complex reductions with loop-carried dependencies—such as those in attention mechanisms—limiting data locality and memory efficiency. This work introduces Neptune, the first GPU tensor compiler supporting Algebraic Correction Expressions (ACE): it formally restructures loop-carried dependencies to generate semantically equivalent yet fusible expressions, enabling aggressive operator fusion while preserving correctness. Neptune automatically generates optimized kernels from high-level scheduling templates, eliminating the need for manual tuning. Evaluated on ten diverse attention benchmarks across four NVIDIA and AMD GPU architectures, Neptune achieves an average 1.35× speedup over state-of-the-art compilers—including Triton and TVM—while significantly improving computational density and memory bandwidth utilization.
📝 Abstract
Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms.
The paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result.
On ten attention-based benchmarks, Neptune, starting from simple attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have average speedup of $1.35 imes$ over the next best alternative, demonstrating its effectiveness for deep learning workloads.