TileLang: A Composable Tiled Programming Model for AI Systems

📅 2025-04-24
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
AI kernel development faces challenges including complex hardware adaptation and insufficient expressiveness and usability of domain-specific compilers. This paper proposes a composable tiling programming model that—uniquely—decouples dataflow from the scheduling space (thread mapping, memory layout, tensorization, and pipelining). By introducing a unified block-thread abstraction, lightweight scheduling primitive annotations, and dataflow-driven, hardware-aware compilation, our approach jointly optimizes developer productivity and kernel performance. The method bridges the gap between expressive power and engineering practicality in domain-specific compilers. Evaluated on mainstream accelerators—including GPUs and ASICs—our generated AI compute kernels achieve state-of-the-art performance across key workloads, while reducing development cycles significantly. The framework delivers both flexibility in algorithmic expression and high execution efficiency.

Technology Category

Application Category

📝 Abstract
Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, hardware-centric optimizations to fully leverage modern accelerators. While domain-specific compilers attempt to reduce the burden of writing high-performance kernels, they often struggle with usability and expressiveness gaps. In this paper, we present TileLang, a generalized tiled programming model for more efficient AI Kernel programming. TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives. This approach allows users to focus on the kernel's data-flow itself, while leaving most other optimizations to compilers. We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels, demonstrating that its unified block-and-thread paradigm and transparent scheduling capabilities deliver both the power and flexibility demanded by modern AI system development.
Problem

Research questions and friction points this paper is trying to address.

Simplify writing high-performance AI kernels
Decouple scheduling from dataflow for flexibility
Achieve state-of-the-art performance with ease
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples scheduling space from dataflow
Uses customization annotations and primitives
Unified block-and-thread paradigm
L
Lei Wang
Peking University, Beijing, China
Y
Yu Cheng
Peking University, Beijing, China
Y
Yining Shi
Peking University, Beijing, China
Zhengju Tang
Zhengju Tang
Peking University
Zhiwen Mo
Zhiwen Mo
Imperial College London
GPU ArchitecturePerformance ModelingDataflow Schedule
W
Wenhao Xie
Peking University, London, United States
Lingxiao Ma
Lingxiao Ma
Senior Researcher, Microsoft Research
Systems for Machine LearningGPU
Yuqing Xia
Yuqing Xia
Microsoft Research
Systems for Machine LearningGPU
Jilong Xue
Jilong Xue
Microsoft Research
distributed systemmachine learningdeep learninggraph processing
F
Fan Yang
Microsoft Research, Beijing, China
Z
Zhi Yang
Peking University, Beijing, China