HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

To address load imbalance in MoE model training on heterogeneous GPU clusters (e.g., A40+V100), caused by divergent hardware affinity between attention and expert modules, this paper proposes a component-decoupled scheduling framework. It introduces a novel “zebra-style” pipelined parallelism that interleaves computation phases across GPU generations; designs a heterogeneity-aware asymmetric expert placement strategy, dynamically assigning GPU roles based on empirically measured per-module performance across devices; and achieves, for the first time, fine-grained, component-level heterogeneous scheduling. Experiments demonstrate up to 2.3× speedup over state-of-the-art MoE training systems. On a cluster with 50% V100 GPUs, the framework sustains 95% of peak throughput while significantly reducing GPU idle time.

Technology Category

Application Category

📝 Abstract

The Mixture-of-Experts (MoE) architecture has become increasingly popular as a method to scale up large language models (LLMs). To save costs, heterogeneity-aware training solutions have been proposed to utilize GPU clusters made up of both newer and older-generation GPUs. However, existing solutions are agnostic to the performance characteristics of different MoE model components (i.e., attention and expert) and do not fully utilize each GPU's compute capability. In this paper, we introduce HeterMoE, a system to efficiently train MoE models on heterogeneous GPUs. Our key insight is that newer GPUs significantly outperform older generations on attention due to architectural advancements, while older GPUs are still relatively efficient for experts. HeterMoE disaggregates attention and expert computation, where older GPUs are only assigned with expert modules. Through the proposed zebra parallelism, HeterMoE overlaps the computation on different GPUs, in addition to employing an asymmetric expert assignment strategy for fine-grained load balancing to minimize GPU idle time. Our evaluation shows that HeterMoE achieves up to 2.3x speed-up compared to existing MoE training systems, and 1.4x compared to an optimally balanced heterogeneity-aware solution. HeterMoE efficiently utilizes older GPUs by maintaining 95% training throughput on average, even with half of the GPUs in a homogeneous A40 cluster replaced with V100.

Problem

Research questions and friction points this paper is trying to address.

Efficiently trains MoE models on heterogeneous GPUs

Optimizes computation for attention and expert modules

Balances load to minimize GPU idle time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregates attention and expert computation across GPUs

Uses zebra parallelism to overlap GPU computations

Implements asymmetric expert assignment for load balancing

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions