mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of designing hardware interconnects for Mixture-of-Experts (MoE) training—where expert communication exhibits strong dynamism and high locality—this paper proposes the first optical circuit switch (OCS)-enhanced interconnect architecture supporting real-time topology reconfiguration *during training*. Our method introduces three key innovations: (1) the first online topology reconfiguration mechanism for distributed MoE training; (2) a zonal, reconfigurable high-bandwidth domain design that balances scalability and adaptability; and (3) a measurement-driven locality modeling framework, validated on a 32×A100 hardware prototype. Under 100 Gbps and 400 Gbps link speeds, our architecture improves training cost-efficiency by 1.2–1.5× and 1.9–2.3×, respectively, across four representative MoE models—matching the performance of a non-blocking fat-tree while significantly reducing hardware complexity and energy overhead.

Technology Category

Application Category

📝 Abstract
Mixture-of-Expert (MoE) models outperform conventional models by selectively activating different subnets, named emph{experts}, on a per-token basis. This gated computation generates dynamic communications that cannot be determined beforehand, challenging the existing GPU interconnects that remain emph{static} during the distributed training process. In this paper, we advocate for a first-of-its-kind system, called mFabric, that unlocks topology reconfiguration emph{during} distributed MoE training. Towards this vision, we first perform a production measurement study and show that the MoE dynamic communication pattern has emph{strong locality}, alleviating the requirement of global reconfiguration. Based on this, we design and implement a emph{regionally reconfigurable high-bandwidth domain} on top of existing electrical interconnects using optical circuit switching (OCS), achieving scalability while maintaining rapid adaptability. We have built a fully functional mFabric prototype with commodity hardware and a customized collective communication runtime that trains state-of-the-art MoE models with emph{in-training} topology reconfiguration across 32 A100 GPUs. Large-scale packet-level simulations show that mFabric delivers comparable performance as the non-blocking fat-tree fabric while boosting the training cost efficiency (e.g., performance per dollar) of four representative MoE models by 1.2$ imes$--1.5$ imes$ and 1.9$ imes$--2.3$ imes$ at 100 Gbps and 400 Gbps link bandwidths, respectively.
Problem

Research questions and friction points this paper is trying to address.

Expert Mixture Models
Hardware Design Challenges
Large-scale Data Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

mFabric
Adaptive High-speed Networking
MoE Model Training Efficiency