DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the limited flexibility in distributed programming for large language model scaling and the inefficiency of existing tensor compilers in handling the complex memory hierarchies of heterogeneous clusters. To overcome these challenges, the authors propose a scalable block-level compiler featuring a novel three-tier hierarchical abstraction—Core, Device, and Task—that uniformly supports diverse parallelization strategies, automatically optimizes intra- and inter-node communication, and enables efficient code generation across both NVIDIA and AMD platforms. When integrated into vLLM, the compiler achieves 5%–30% end-to-end inference speedup and over 10% improvement in training model FLOPs utilization (MFU), translating to approximately 500,000 GPU hours saved per month. The system has been deployed in enterprise settings, delivering over 20% inference performance gains.
📝 Abstract
The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for rapidly evolving model architectures. Conversely, existing tensor compilers fail to address the complex memory hierarchy of distributed clusters effectively. To bridge this gap, we propose DITRON, a scalable tile-level compiler that democratizes high-performance distributed kernel development. DITRON introduces a novel hierarchical programming abstraction spanning Core, Device, and Task levels to map tensor programs efficiently onto heterogeneous distributed hardware. This abstraction allows DITRON to support diverse parallelism strategies while abstracting away the complexity of inter-node and intra-node communication. Evaluated across large-scale clusters, DITRON achieves performance parity with or exceeding expert-tuned CUDA libraries, delivering speedups of $6\%-30\%$ on isolated kernels and $5\%-30\%$ on end-to-end inference in vLLM. Furthermore, DITRON demonstrates strong portability, achieving significant speedups on both NVIDIA and AMD platforms. \ours{} has been deployed at the enterprise level for both training and inference. It achieves an MFU improvement of over 10\% in training tasks, saving approximately 500,000 GPU hours of training cost per month. For inference tasks, it delivers an end-to-end gain of over 20\% and has been applied to cloud service inference and edge inference scenarios.
Problem

Research questions and friction points this paper is trying to address.

distributed programming
tensor compilers
memory hierarchy
large language models
parallelism
Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed tensor compilation
hierarchical tiling
multi-level parallelism
hardware portability
high-performance kernel generation
🔎 Similar Papers
No similar papers found.
Size Zheng
Size Zheng
ByteDance Seed
ArchitectureCompilerDeep Learning
X
Xuegui Zheng
ByteDance Seed
H
Hanshi Sun
ByteDance Seed
Qi Hou
Qi Hou
tsinghua university, undergraduate
AIDM
W
Wenlei Bao
ByteDance Seed
S
Shiyu Li
ByteDance Seed
H
Haojie Duanmu
ByteDance Seed
J
Jin Fang
ByteDance Seed
C
Chenli Xue
ByteDance Seed
C
Chenhui Huang
ByteDance Seed
Y
Yuanqiang Liu
ByteDance Seed
Renze Chen
Renze Chen
Peking University
Ningxin Zheng
Ningxin Zheng
Bytedance AML
D
Dongyang Wang
ByteDance Seed
Li-Wen Chang
Li-Wen Chang
Research Scientist, ByteDance
High Performance ComputingCompilerComputer ArchitectureAlgorithmsDeep Learning
Liqiang Lu
Liqiang Lu
Zhejiang University
Deep learningAcceleratorQuantum Computing
Y
Yun Liang
Peking University
Jidong Zhai
Jidong Zhai
Tsinghua University
Parallel ComputingCompilerProgramming ModelGPU
Xin Liu
Xin Liu
Bytedance MLSys
MLSys