RailS: Load Balancing for All-to-All Communication in Distributed Mixture-of-Experts Training

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

265K/year

🤖 AI Summary

To address the dominant iteration latency caused by sparse and highly imbalanced all-to-all communication in distributed Mixture-of-Experts (MoE) training, this paper proposes RailS—a novel communication framework. First, it leverages the topological symmetry of Rail interconnects to formally prove that uniform send scheduling guarantees uniform receive scheduling. Second, it introduces a decentralized Longest-Processing-Time (LPT) spraying scheduler that transforms global load balancing into coordination-free local decisions. Third, it integrates topology-aware multipath transmission to activate multiple parallel Rails for fine-grained bandwidth aggregation. Evaluated on both realistic (Mixtral) and synthetic MoE workloads, RailS improves effective bus bandwidth by 20%–78%, reduces communication completion time by 17%–78%, and decreases end-to-end iteration time by 18%–40%, approaching theoretically optimal load balance.

Technology Category

Application Category

📝 Abstract

Training Mixture-of-Experts (MoE) models introduces sparse and highly imbalanced all-to-all communication that dominates iteration time. Conventional load-balancing methods fail to exploit the deterministic topology of Rail architectures, leaving multi-NIC bandwidth underutilized. We present RailS, a distributed load-balancing framework that minimizes all-to-all completion time in MoE training. RailS leverages the Rail topology's symmetry to prove that uniform sending ensures uniform receiving, transforming global coordination into local scheduling. Each node independently executes a Longest Processing Time First (LPT) spraying scheduler to proactively balance traffic using local information. RailS activates N parallel rails for fine-grained, topology-aware multipath transmission. Across synthetic and real-world MoE workloads, RailS improves bus bandwidth by 20%--78% and reduces completion time by 17%--78%. For Mixtral workloads, it shortens iteration time by 18%--40% and achieves near-optimal load balance, fully exploiting architectural parallelism in distributed training.

Problem

Research questions and friction points this paper is trying to address.

Optimizing all-to-all communication load balancing in distributed MoE training

Addressing bandwidth underutilization in multi-NIC Rail architectures

Reducing iteration time through topology-aware traffic distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

RailS leverages Rail topology symmetry for uniform communication

Each node independently executes LPT spraying scheduler locally

RailS activates parallel rails for multipath transmission optimization

🔎 Similar Papers

No Need to Talk: Asynchronous Mixture of Language Models