🤖 AI Summary
AI kernel development faces challenges including complex hardware adaptation and insufficient expressiveness and usability of domain-specific compilers. This paper proposes a composable tiling programming model that—uniquely—decouples dataflow from the scheduling space (thread mapping, memory layout, tensorization, and pipelining). By introducing a unified block-thread abstraction, lightweight scheduling primitive annotations, and dataflow-driven, hardware-aware compilation, our approach jointly optimizes developer productivity and kernel performance. The method bridges the gap between expressive power and engineering practicality in domain-specific compilers. Evaluated on mainstream accelerators—including GPUs and ASICs—our generated AI compute kernels achieve state-of-the-art performance across key workloads, while reducing development cycles significantly. The framework delivers both flexibility in algorithmic expression and high execution efficiency.
📝 Abstract
Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, hardware-centric optimizations to fully leverage modern accelerators. While domain-specific compilers attempt to reduce the burden of writing high-performance kernels, they often struggle with usability and expressiveness gaps. In this paper, we present TileLang, a generalized tiled programming model for more efficient AI Kernel programming. TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives. This approach allows users to focus on the kernel's data-flow itself, while leaving most other optimizations to compilers. We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels, demonstrating that its unified block-and-thread paradigm and transparent scheduling capabilities deliver both the power and flexibility demanded by modern AI system development.