FLASH: Fast All-to-All Communication in GPU Clusters

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
All-to-All communication in GPU clusters suffers from incast congestion, straggler effects, and high scheduling overhead due to heterogeneous interconnects (e.g., NVLink and Ethernet). Method: We propose the first lightweight scheduling framework that simultaneously achieves theoretical near-optimality and practical low overhead. It introduces a hierarchical network model to decouple intra- and inter-node communication, a polynomial-time algorithm to maximize bottleneck-link utilization, and a background GPU-to-GPU data pre-migration mechanism. Contribution/Results: We theoretically prove that, under high-speed intra-node networks, our approach asymptotically approaches the optimal completion time with negligible computational overhead. Experiments show that our method achieves All-to-All completion times comparable to exact solvers like TACCL, while reducing scheduling latency by 3–4 orders of magnitude—significantly outperforming existing heuristic and optimization-based schedulers.

Technology Category

Application Category

📝 Abstract
Scheduling All-to-All communications efficiently is fundamental to minimizing job completion times in distributed systems. Incast and straggler flows can slow down All-to-All transfers; and GPU clusters bring additional straggler challenges due to highly heterogeneous link capacities between technologies like NVLink and Ethernet. Existing schedulers all suffer high overheads relative to theoretically optimal transfers. Classical, simple scheduling algorithms such as SpreadOut fail to minimize transfer completion times; modern optimization-based schedulers such as TACCL achieve better completion times but with computation times that can be orders of magnitude longer than the transfer itself. This paper presents FLASH, which schedules near-optimal All-to-All transfers with a simple, polynomial time algorithm. FLASH keeps the bottleneck inter-server network maximally utilized and, in the background, shuffles data between GPUs over fast intra-server networks to mitigate stragglers. We prove that, so long as intra-server networks are significantly faster than inter-server networks, FLASH approaches near-optimal transfer completion times. We implement FLASH and demonstrate that its computational overheads are negligible, yet it achieves transfer completion times that are comparable to state-of-the-art solver-based schedulers.
Problem

Research questions and friction points this paper is trying to address.

Efficiently schedule All-to-All communications in GPU clusters
Minimize transfer completion times despite heterogeneous link capacities
Reduce computational overhead compared to optimization-based schedulers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses polynomial time algorithm for scheduling
Maximizes inter-server network utilization
Leverages fast intra-server networks for stragglers
🔎 Similar Papers
No similar papers found.
Y
Yiran Lei
Carnegie Mellon University
Dongjoo Lee
Dongjoo Lee
MangoBoost Inc.
Liangyu Zhao
Liangyu Zhao
CS PhD Student, University of Washington - Seattle
Computer Science
D
Daniar Kurniawan
MangoBoost Inc.
C
Chanmyeong Kim
MangoBoost Inc.
H
Heetaek Jeong
MangoBoost Inc.
C
Changsu Kim
MangoBoost Inc.
H
Hyeonseong Choi
MangoBoost Inc.
Liangcheng Yu
Liangcheng Yu
Microsoft Research
Systems and Networking
A
Arvind Krishnamurthy
University of Washington
J
Justine Sherry
Carnegie Mellon University
Eriko Nurvitadhi
Eriko Nurvitadhi
MangoBoost
Computer architecturehardware acceleratorAIFPGAsworkload characterization