Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

πŸ“… 2025-11-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address severe performance overheads in distributed large language models (LLMs) arising from bulk synchronous parallel (BSP) execution across multiple GPUs, this paper proposes a desynchronized systems approach. We introduce a novel β€œThree Taxation” analytical framework to identify and eliminate fine-grained overheads across inter-GPU communication, synchronization, and scheduling layers. Leveraging Iris for Triton, we implement kernel-level communication primitives and design a tile-grained dataflow synchronization pipeline that replaces global barriers. Furthermore, we co-optimize All-Gather with matrix multiplication and integrate Flash Decode. Experimental results demonstrate 10–20% end-to-end latency reduction in critical compute kernels, significantly improving both execution efficiency and programming flexibility for distributed LLM inference and training.

Technology Category

Application Category

πŸ“ Abstract
As large language models (LLMs) continue to scale, their workloads increasingly rely on distributed execution across multiple GPUs. However, the conventional bulk synchronous parallel~(BSP) model used in such settings introduces significant performance inefficiencies. To characterize these bottlenecks, we introduce the''Three Taxes''(Bulk Synchronous, Inter-Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework. We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution. By exploiting libraries like Iris for Triton, we gain access to in-kernel communication primitives that enable the design of novel fine-grained programming patterns, offering greater flexibility and performance than traditional BSP-based approaches. These patterns systematically eliminate the three taxes by creating direct, tile-level producer-consumer pipelines and replacing global barriers with fine-grained dataflow synchronization. Applying this methodology to critical kernels, from the foundational All-Gather + general matrix multiplication operation to the complex Flash Decode algorithm, we observe a 10-20% speedup in end-to-end latency over BSP-based approaches, establishing a more programmable and efficient paradigm for distributed LLM workloads.
Problem

Research questions and friction points this paper is trying to address.

Eliminating performance inefficiencies in multi-GPU distributed LLM execution
Addressing synchronization and communication bottlenecks in distributed GPU systems
Improving computational efficiency through fine-grained dataflow synchronization patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained programming patterns replace BSP model
Tile-level producer-consumer pipelines eliminate synchronization taxes
In-kernel communication enables dataflow synchronization
πŸ”Ž Similar Papers
No similar papers found.
O
Octavian Alexandru Trifan
University of California, Irvine
K
Karthik Sangaiah
AMD Research and Advanced Development
M
Muhammad A. Awad
AMD Research and Advanced Development
M
Muhammad Osama
AMD Research and Advanced Development
S
Sumanth Gudaparthi
AMD Research and Advanced Development
A
Alexandru Nicolau
University of California, Irvine
A
A. Veidenbaum
University of California, Irvine
Ganesh Dasika
Ganesh Dasika
Unknown affiliation