Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing tensor compilers struggle to fuse complex reductions with loop-carried dependencies—such as those in attention mechanisms—limiting data locality and memory efficiency. This work introduces Neptune, the first GPU tensor compiler supporting Algebraic Correction Expressions (ACE): it formally restructures loop-carried dependencies to generate semantically equivalent yet fusible expressions, enabling aggressive operator fusion while preserving correctness. Neptune automatically generates optimized kernels from high-level scheduling templates, eliminating the need for manual tuning. Evaluated on ten diverse attention benchmarks across four NVIDIA and AMD GPU architectures, Neptune achieves an average 1.35× speedup over state-of-the-art compilers—including Triton and TVM—while significantly improving computational density and memory bandwidth utilization.

Technology Category

Application Category

📝 Abstract

Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms. The paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. On ten attention-based benchmarks, Neptune, starting from simple attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have average speedup of $1.35 imes$ over the next best alternative, demonstrating its effectiveness for deep learning workloads.

Problem

Research questions and friction points this paper is trying to address.

Fusing complex reduction operators with loop-carried dependencies in deep learning

Optimizing attention mechanisms and similar computations on GPU architectures

Improving data locality and parallelism while maintaining computational correctness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses reduction operators with dependency breaking

Uses algebraic correction for result accuracy

Outperforms existing compilers on GPU architectures

🔎 Similar Papers

No similar papers found.