RailS: Load Balancing for All-to-All Communication in Distributed Mixture-of-Experts Training

πŸ“… 2025-10-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the dominant iteration latency caused by sparse and highly imbalanced all-to-all communication in distributed Mixture-of-Experts (MoE) training, this paper proposes RailSβ€”a novel communication framework. First, it leverages the topological symmetry of Rail interconnects to formally prove that uniform send scheduling guarantees uniform receive scheduling. Second, it introduces a decentralized Longest-Processing-Time (LPT) spraying scheduler that transforms global load balancing into coordination-free local decisions. Third, it integrates topology-aware multipath transmission to activate multiple parallel Rails for fine-grained bandwidth aggregation. Evaluated on both realistic (Mixtral) and synthetic MoE workloads, RailS improves effective bus bandwidth by 20%–78%, reduces communication completion time by 17%–78%, and decreases end-to-end iteration time by 18%–40%, approaching theoretically optimal load balance.

Technology Category

Application Category

πŸ“ Abstract
Training Mixture-of-Experts (MoE) models introduces sparse and highly imbalanced all-to-all communication that dominates iteration time. Conventional load-balancing methods fail to exploit the deterministic topology of Rail architectures, leaving multi-NIC bandwidth underutilized. We present RailS, a distributed load-balancing framework that minimizes all-to-all completion time in MoE training. RailS leverages the Rail topology's symmetry to prove that uniform sending ensures uniform receiving, transforming global coordination into local scheduling. Each node independently executes a Longest Processing Time First (LPT) spraying scheduler to proactively balance traffic using local information. RailS activates N parallel rails for fine-grained, topology-aware multipath transmission. Across synthetic and real-world MoE workloads, RailS improves bus bandwidth by 20%--78% and reduces completion time by 17%--78%. For Mixtral workloads, it shortens iteration time by 18%--40% and achieves near-optimal load balance, fully exploiting architectural parallelism in distributed training.
Problem

Research questions and friction points this paper is trying to address.

Optimizing all-to-all communication load balancing in distributed MoE training
Addressing bandwidth underutilization in multi-NIC Rail architectures
Reducing iteration time through topology-aware traffic distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

RailS leverages Rail topology symmetry for uniform communication
Each node independently executes LPT spraying scheduler locally
RailS activates parallel rails for multipath transmission optimization
πŸ”Ž Similar Papers
No similar papers found.