DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

267K/year

🤖 AI Summary

This work addresses the network bottleneck in expert parallel training caused by cross-node all-to-all communication, which is particularly severe when the compute-to-communication ratio is imbalanced between attention and feed-forward network (FFN) layers. To mitigate this, the authors propose decoupling attention and FFN layers onto distinct GPU groups, forming a unidirectional many-to-many communication pattern within a heterogeneous pipeline. Guided by a compute-communication roofline model for resource allocation and augmented with a bandwidth-aware load balancing strategy, this approach achieves the first physical separation and efficient overlap of attention and FFN computations. Implemented atop Megatron-LM on a 16-node cluster with 8×H800 GPUs per node, the system significantly alleviates the communication bottleneck in mixture-of-experts (MoE) training, yielding up to a 1.8× improvement in training efficiency.

📝 Abstract

Mixture-of-experts (MoE) architectures enable trillion-parameter LLMs with sparsely activated experts. Expert parallelism (EP) is a widely adopted MoE training strategy, but it suffers from severe all-to-all communication bottlenecks, which is exaggerated by the limited inter-node network bandwidth as the growing model size requires distributing experts across GPU nodes. Prior work focused on overlapping these all-to-all communications with feed-forward network (FFN) and self-attention computations, which often leaves residual network-bound stalls due to inherent imbalance in attention and FFN layers' computation-communication ratios. We present DisagMoE, a disaggregated MoE training system that jointly optimizes model placement and scheduling for maximal efficiency. DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with uni-directional, many-to-many communications, and employs a computation-communication roofline model to balance GPU and network bandwidth allocation among the attention and FFN groups. DisagMoE is implemented on Megatron-LM, and evaluation shows that DisagMoE improves training efficiency across multiple MoE models with up to 1.8x speedup on 16-node 8xH800 clusters.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

Expert Parallelism

All-to-all Communication

Communication Bottleneck

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated Parallelism

MoE Training

Computation-Communication Overlap