AutoOverlap: Enabling Fine-Grained Overlap of Computation and Communication with Chunk-Based Scheduling

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the growing communication bottleneck in large-scale GPU workloads, where existing distributed compilers support only coarse-grained, stream-level computation-communication overlap, leading to excessive kernel launches, device synchronizations, and communication tail latency. To overcome these limitations, the authors introduce a communication chunk abstraction and a chunk-level scheduling mechanism that enables fine-grained, automatic computation-communication overlap within a single fused kernel—decoupling communication granularity from kernel structure for the first time. The approach is compatible with compiler-based migration, user-written code, and template-instantiated schedules, and is implemented as a source-to-source transformation based on Triton. Experimental results demonstrate end-to-end speedups of up to 4.7× and an average of 1.3× across multi-GPU workloads.

Technology Category

Application Category

📝 Abstract

Communication has become a first-order bottleneck in large-cale GPU workloads, and existing distributed compilers address it mainly by overlapping whole compute and communication kernels at the stream level. This coarse granularity incurs extra kernel launches, forces device-wide synchronizations at kernel boundaries, and leaves substantial slack when the slowest tile or kernel stretches the communication tail. We present AutoOverlap, a compiler and runtime that enables automatic fine-grained overlap inside a single fused kernel. AutoOverlap introduces a communication chunk abstraction that decouples communication granularity from kernel structure and backend mechanisms, allowing chunk-level plans to be ported from existing distributed compilers, written directly by users, or instantiated from reusable templates. Given a local Triton kernel and a chunk schedule, AutoOverlap performs transformations to align computation with chunk availability. Implemented as a source-to-source compiler on Triton, AutoOverlap delivers an average end-to-end speedup of 1.3$\times$ and up to 4.7$\times$ on multi-GPU workloads.

Problem

Research questions and friction points this paper is trying to address.

communication-computation overlap

fine-grained scheduling

GPU communication bottleneck

kernel fusion

distributed compilation

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained overlap

communication chunk

source-to-source compilation