FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion

πŸ“… 2025-12-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In MoE models, expert parallelism relies on distributed data redistribution (e.g., All-to-All), but existing communication libraries (e.g., NCCL) incur high overheadβ€”often exceeding 50% of end-to-end latency. This work proposes a lightweight, transformation-aware redistribution scheme that tightly integrates data transformation with communication. Our key contributions are: (1) the first transformation-communication fusion paradigm; (2) layout-aware, fine-grained scheduling and a pipelined All-to-All variant; and (3) runtime topology-adaptive dynamic planning with zero-copy routing. Experiments under typical MoE configurations show that our approach achieves up to 3.84Γ— and 2.01Γ— higher redistribution throughput than NCCL and DeepEP, respectively; reduces training latency by 1.17–1.39Γ— and 1.10–1.19Γ—; and improves first-token generation latency by 1.09–1.25Γ— and 1.06–1.16Γ—.

Technology Category

Application Category

πŸ“ Abstract
Large-scale Mixture-of-Experts (MoE) models rely on emph{expert parallelism} for efficient training and inference, which splits experts across devices and necessitates distributed data shuffling to route each token to its assigned experts. However, existing communication libraries handle this shuffling poorly; its overhead can account for over half of end-to-end runtime. We present FUSCO, an MoE-friendly communication library that achieves efficient and lightweight data shuffling through fused data transformation and communication, based on the key observation that MoE's expert-major data layout conflicts with the device-major layout expected by communication operations. FUSCO captures the fine-grained data layout, which is then interpreted by a pipelined communication engine that performs the required shuffling efficiently along the communication path. Lightweight planning and load-balancing mechanisms complement the engine by eliminating redundant communication and dispersing traffic. Evaluations on representative benchmarks illustrate that FUSCO achieves up to 3.84$ imes$ and 2.01$ imes$ speedups over NCCL and DeepEP (the state-of-the-art MoE communication library), respectively. In end-to-end MoE tasks, compared to NCCL and DeepEP, FUSCO reduces the training latency by 1.17-1.39$ imes$ and 1.10-1.19$ imes$, and lowers the first-token generation latency in inference by 1.09-1.25$ imes$ and 1.06-1.16$ imes$.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient distributed data shuffling in MoE expert parallelism
Reduces communication overhead dominating end-to-end runtime in MoE training
Optimizes data layout conflicts between expert-major and device-major formats
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses data transformation and communication for efficient shuffling
Uses pipelined engine interpreting fine-grained data layouts
Employs lightweight planning and load-balancing to reduce overhead
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhuoran Zhu
Tsinghua University
C
Chunyang Zhu
Infinigence AI
H
Hao Lin
Infinigence AI
X
Xu Fu
Infinigence AI
Yiming Zhou
Yiming Zhou
Meta | UCLA
Q
Quanlu Zhang
Infinigence AI
Z
Zhenhua Li
Tsinghua University
F
Feng Qian
University of Southern California
C
Chao Yu
Zhongguancun Academy
B
Boxun Li
Infinigence AI
Guohao Dai
Guohao Dai
Associate Professor of Shanghai Jiao Tong University
Sparse ComputationLarge-scale Graph ProcessingFPGACircuits and Systems
Y
Yu Wang
Tsinghua University