Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Distributed AI systems suffer from poor latency hiding due to the difficulty of jointly optimizing computation, communication, and memory access. To address this, we propose the first Triton compiler extension supporting native overlap optimization. Our approach—operating transparently beneath the Python frontend—leverages compiler-level distributed scheduling to automatically fuse and co-optimize compute, OpenSHMEM-based communication, and memory operations, enabling fine-grained overlap across single- and multi-node settings. Its key innovation lies in deeply integrating heterogeneous resource-aware overlap optimization into a high-level DSL compilation stack, eliminating the need for manual CUDA/C++ implementation. Evaluated on a 64-GPU cluster, our optimized distributed AI kernels outperform hand-tuned baselines across multiple workloads, while significantly improving developer productivity and lowering the barrier to distributed systems programming.

Technology Category

Application Category

📝 Abstract
In this report, we propose Triton-distributed, an extension of existing Triton compiler, to overcome the programming challenges in distributed AI systems. Triton-distributed is the first compiler that supports native overlapping optimizations for distributed AI workloads, providing a good coverage of existing optimizations from different frameworks. First, we integrate communication primitives compliant with the OpenSHMEM standard into the compiler. This enables programmers to utilize these primitives with a higher-level Python programming model. Second, we illustrate how to achieve complex joint optimization of computation, memory access, and communication with the assistance of the compiler. In particular, we show how to use overlapping techniques to hide latency and present our compiler-based programming methods in both single-node and multi-node scenarios. Finally, we showcase the performance of the code generated by our compiler. In a test environment with up to 64 devices, our compiler can fully utilize heterogeneous communication and computation resources to provide effective overlapping and high performance. In many cases, the performance of the generated code can even outperform hand-optimized code. Moreover, the development difficulty and the time cost for development using our compiler are far less than those of low-level programming such as CUDA/C++, which clearly demonstrates significant productivity advantages.
Problem

Research questions and friction points this paper is trying to address.

Enables overlapping optimizations for distributed AI workloads
Integrates OpenSHMEM communication primitives with Python programming
Reduces development difficulty compared to low-level programming like CUDA/C++
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Triton compiler for distributed AI systems
Integrates OpenSHMEM communication primitives in Python
Optimizes computation, memory, and communication jointly
🔎 Similar Papers
No similar papers found.
Size Zheng
Size Zheng
ByteDance Seed
ArchitectureCompilerDeep Learning
W
Wenlei Bao
ByteDance Seed
Qi Hou
Qi Hou
tsinghua university, undergraduate
AIDM
X
Xuegui Zheng
ByteDance Seed
J
Jin Fang
ByteDance Seed
C
Chenhui Huang
ByteDance Seed
T
Tianqi Li
ByteDance Seed, Peking University
H
Haojie Duanmu
ByteDance Seed, Shanghai Jiao Tong University
Renze Chen
Renze Chen
Peking University
R
Ruifan Xu
ByteDance Seed, Peking University
Y
Yifan Guo
ByteDance Seed, Zhejiang University
Ningxin Zheng
Ningxin Zheng
Bytedance AML
Ziheng Jiang
Ziheng Jiang
Research Scientist, ByteDance
SystemsMachine Learning
X
Xinyi Di
ByteDance Seed
D
Dongyang Wang
ByteDance Seed
J
Jianxi Ye
ByteDance Seed
Haibin Lin
Haibin Lin
Bytedance
Machine Learning SystemsNatural Language Processing
Li-Wen Chang
Li-Wen Chang
Research Scientist, ByteDance
High Performance ComputingCompilerComputer ArchitectureAlgorithmsDeep Learning
Liqiang Lu
Liqiang Lu
Zhejiang University
Deep learningAcceleratorQuantum Computing
Y
Yun Liang
Peking University
Jidong Zhai
Jidong Zhai
Tsinghua University
Parallel ComputingCompilerProgramming ModelGPU
X
Xin Liu
ByteDance Seed