🤖 AI Summary
In distributed Mixture-of-Experts (MoE) training, computation and communication scheduling are traditionally decoupled, neglecting holistic optimization across critical operations—including multi-head attention (MHA), gating, and all-reduce. This paper introduces FlowMoE, the first framework to unify scheduling of heterogeneous tasks: MHA, gating, expert computation, and all-to-all/all-reduce communication. FlowMoE employs a tensor-blocking–driven priority scheduler that tightly integrates pipeline parallelism with computation-communication overlap. Implemented as an adaptive, general-purpose library atop PyTorch, it supports diverse MoE architectures and hardware configurations. Extensive experiments demonstrate consistent improvements: 13–57% reduction in training time, 10–39% lower energy consumption, and 7–32% decreased memory footprint. These gains significantly enhance both training efficiency and resource utilization in distributed MoE systems.
📝 Abstract
The parameter size of modern large language models (LLMs) can be scaled up via the sparsely-activated Mixture-of-Experts (MoE) technique to avoid excessive increase of the computational costs. To further improve training efficiency, pipelining computation and communication has become a promising solution for distributed MoE training. However, existing work primarily focuses on scheduling tasks within the MoE layer, such as expert computing and all-to-all (A2A) communication, while neglecting other key operations including multi-head attention (MHA) computing, gating, and all-reduce communication. In this paper, we propose FlowMoE, a scalable framework for scheduling multi-type task pipelines. First, FlowMoE constructs a unified pipeline to consistently scheduling MHA computing, gating, expert computing, and A2A communication. Second, FlowMoE introduces a tensor chunk-based priority scheduling mechanism to overlap the all-reduce communication with all computing tasks. We implement FlowMoE as an adaptive and generic framework atop PyTorch. Extensive experiments with 675 typical MoE layers and four real-world MoE models across two GPU clusters demonstrate that our proposed FlowMoE framework outperforms state-of-the-art MoE training frameworks, reducing training time by 13%-57%, energy consumption by 10%-39%, and memory usage by 7%-32%.