TileLang: A Composable Tiled Programming Model for AI Systems

📅 2025-04-24

📈 Citations: 1

✨ Influential: 0

career value

204K/year

🤖 AI Summary

AI kernel development faces challenges including complex hardware adaptation and insufficient expressiveness and usability of domain-specific compilers. This paper proposes a composable tiling programming model that—uniquely—decouples dataflow from the scheduling space (thread mapping, memory layout, tensorization, and pipelining). By introducing a unified block-thread abstraction, lightweight scheduling primitive annotations, and dataflow-driven, hardware-aware compilation, our approach jointly optimizes developer productivity and kernel performance. The method bridges the gap between expressive power and engineering practicality in domain-specific compilers. Evaluated on mainstream accelerators—including GPUs and ASICs—our generated AI compute kernels achieve state-of-the-art performance across key workloads, while reducing development cycles significantly. The framework delivers both flexibility in algorithmic expression and high execution efficiency.

Technology Category

Application Category

📝 Abstract

Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, hardware-centric optimizations to fully leverage modern accelerators. While domain-specific compilers attempt to reduce the burden of writing high-performance kernels, they often struggle with usability and expressiveness gaps. In this paper, we present TileLang, a generalized tiled programming model for more efficient AI Kernel programming. TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives. This approach allows users to focus on the kernel's data-flow itself, while leaving most other optimizations to compilers. We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels, demonstrating that its unified block-and-thread paradigm and transparent scheduling capabilities deliver both the power and flexibility demanded by modern AI system development.

Problem

Research questions and friction points this paper is trying to address.

Simplify writing high-performance AI kernels

Decouple scheduling from dataflow for flexibility

Achieve state-of-the-art performance with ease

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples scheduling space from dataflow

Uses customization annotations and primitives

Unified block-and-thread paradigm

🔎 Similar Papers

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation