Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

📅 2024-04-07
🏛️ arXiv.org
📈 Citations: 8
Influential: 1
📄 PDF
🤖 AI Summary
Expert parallelism in sparse, gated Mixture-of-Experts (MoE) models suffers from severe All-to-All communication bottlenecks, exacerbated by the serial dependency between communication and computation in existing approaches. This work proposes ScMoE, a novel MoE architecture that introduces shortcut connections and a communication-computation overlap scheduling mechanism—enabling, for the first time, full-stage 100% overlap between communication and computation in expert parallelism and thereby fully decoupling their temporal dependencies. By jointly optimizing expert parallelism, sparse gating, and data routing, ScMoE achieves state-of-the-art efficiency without sacrificing accuracy: it matches or exceeds top-2 baseline models in precision while accelerating training by 1.49× and inference by 1.82×. ScMoE establishes a new paradigm for efficient MoE training and deployment, addressing fundamental scalability limitations in large-scale sparse models.

Technology Category

Application Category

📝 Abstract
Expert parallelism has emerged as a key strategy for distributing the computational workload of sparsely-gated mixture-of-experts (MoE) models across multiple devices, enabling the processing of increasingly large-scale models. However, the All-to-All communication inherent to expert parallelism poses a significant bottleneck, limiting the efficiency of MoE models. Although existing optimization methods partially mitigate this issue, they remain constrained by the sequential dependency between communication and computation operations. To address this challenge, we propose ScMoE, a novel shortcut-connected MoE architecture integrated with an overlapping parallelization strategy. ScMoE decouples communication from its conventional sequential ordering, enabling up to 100% overlap with computation. Compared to the prevalent top-2 MoE baseline, ScMoE achieves speedups of 1.49 times in training and 1.82 times in inference. Moreover, our experiments and analyses indicate that ScMoE not only achieves comparable but in some instances surpasses the model quality of existing approaches.
Problem

Research questions and friction points this paper is trying to address.

All-to-All communication bottleneck in expert parallelism
Sequential dependency between communication and computation
Efficiency limitations in Mixture-of-Experts models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shortcut-connected MoE architecture
Overlapping parallelization strategy
Decouples communication from computation
🔎 Similar Papers
No similar papers found.