TL: Automatic End-to-End Compiler of Tile-Based Languages for Spatial Dataflow Architectures

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Spatial dataflow architectures suffer from low programmability and heavy reliance on vendor-provided hand-optimized libraries, hindering realization of their high performance and energy efficiency potential. Method: This paper proposes TL, an end-to-end compilation framework built atop MLIR. TL introduces the first hardware-aware unified intermediate representation (IR) supporting cross-architecture modeling of topology, storage, and computation. It transcends single-tile optimization by enabling holistic co-optimization of global tile distribution, on-chip network (NoC) communication, and distributed memory reuse. Contribution/Results: By integrating hardware topology modeling, multi-level memory hierarchy analysis, and NoC-aware scheduling, TL significantly improves data reuse and reduces communication overhead on heterogeneous spatial accelerators. It achieves, for the first time, automatic and efficient mapping of high-level tiled programs (e.g., Triton) without dependence on vendor-specific libraries.

Technology Category

Application Category

📝 Abstract
Spatial dataflow accelerators are a promising direction for next-generation computer systems because they can reduce the memory bottlenecks of traditional von Neumann machines such as CPUs and GPUs. They do so by organizing computation around explicit, compiler-managed data movement over the on-chip network, allowing operands to be directly forwarded between processing elements and reducing reliance on high-latency, bandwidth-limited global shared memory. Such localized communications can provide higher throughput and efficiency compared to repeated off-chip memory accesses. However, their end-to-end performance depends strongly on how workloads are mapped to the hardware. Naive mappings can perform very poorly, and most users rely on hand-tuned vendor libraries. In practice, although existing spatial-dataflow accelerators have strong potential for high performance, energy- and cost-efficiency, their limited programmability remains a major barrier to their wider adoption. This paper presents TL, an end-to-end framework that compiles tile-based programs (such as Triton kernels) onto spatial dataflow architectures. Unlike most existing compiler frameworks that focus on optimizing code generation within a single tile, TL addresses the central challenge of distributing tile instances across spatially distributed cores and exploiting the on-chip network and distributed memories to increase data reuse and reduce communications. TL proposes a hardware representation that captures interconnect topology, memory hierarchy, and compute capabilities, enabling both specialized architecture-specific optimizations and support for diverse spatial dataflow targets. TL is built on the MLIR ecosystem and defines a generic entry point for different front-ends and an end point for different back-ends.
Problem

Research questions and friction points this paper is trying to address.

Compiles tile-based programs for spatial dataflow accelerators
Optimizes tile distribution across cores to enhance data reuse
Provides a hardware-aware framework supporting diverse accelerator targets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic end-to-end compiler for tile-based languages
Hardware representation capturing interconnect and memory hierarchy
Built on MLIR ecosystem with generic front-end and back-end
🔎 Similar Papers
2024-04-02International Conference on Architectural Support for Programming Languages and Operating SystemsCitations: 4
W
Wei Li
School of Computing, National University of Singapore
Z
Zhenyu Bai
School of Computing, National University of Singapore
H
Heru Wang
School of Computing, National University of Singapore
Pranav Dangi
Pranav Dangi
National University Of Singapore
Computer ArchitectureCompilersReconfigurable Computing
Z
Zhiqiang Zhang
School of Computing, National University of Singapore
C
Cheng Tan
Arizona State University and Google
H
Huiying Lan
Lumai Ltd.
Weng-Fai Wong
Weng-Fai Wong
Associate Professor of Computer Science, National University of Singapore
Computer architecturecompilershigh performance computingembedded systemsparallel processing
Tulika Mitra
Tulika Mitra
Professor of Computer Science, National University of Singapore
Design AutomationLow Power DesignEmbedded SystemsReal-Time Systems