TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In distributed training of large language models, intra-layer parallel operators suffer from low computation-communication overlap efficiency, high manual optimization complexity, and error-proneness. Method: This paper introduces *tile-centric primitives*—a novel, co-designed abstraction interface for computation and communication that decouples high-level semantic specification from low-level instruction fusion. Leveraging a compiler architecture with collaborative frontend (primitive modeling) and backend (low-level communication instruction generation), it enables fully automatic compilation of overlap-optimized GPU kernels. Contribution/Results: The approach achieves both usability and performance: it accelerates execution by 1.17×–20.76× over non-overlapping baselines, matching state-of-the-art hand-optimized libraries—marking the first breakthrough in simultaneously achieving high automation and peak execution efficiency for intra-layer parallelism.

Technology Category

Application Category

📝 Abstract
Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping computation with communication. The overlapping can be achieved through either operator decomposition or kernel fusion. While decomposing operators is straightforward to implement, it often results in suboptimal performance. On the other hand, fusing communication kernels with compute kernels demands significant expertise and is error-prone. In this paper, we propose TileLink to enable efficient compilation and generation of overlapped compute-communication kernels. TileLink is composed of frontend and backend. In the frontend, TileLink decouples the design space of communication and computation, linking these two parts via tile-centric primitives. In the backend, TileLink translates these primitives into low-level communication instructions, integrating the communication and computation components to achieve overlapped execution. In experiments, TileLink achieves from $1.17 imes$ to $20.76 imes$ speedup to non-overlapping baseline and achieves performance comparable to state-of-the-art overlapping libraries on GPUs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing performance of intra-layer parallel operators
Overlapping computation with communication efficiently
Simplifying kernel fusion for distributed model execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

TileLink uses tile-centric primitives for decoupling
Frontend-backend design enables efficient kernel generation
Achieves computation-communication overlap for speedup
🔎 Similar Papers
No similar papers found.