Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

258K/year
🤖 AI Summary
This work addresses the vulnerability of existing Expert Parallelism (EP)-based Mixture-of-Experts (MoE) inference systems to partial GPU node failures, which cause service disruption due to their static membership assumptions. To overcome this limitation, the authors propose a dynamic EP validity framework that enables fault recovery and node reintegration through explicitly mutable runtime membership states. Key innovations include an EEP communication and runtime substrate, a bandwidth-aware hierarchical expert migration scheme, variable membership state management, and a CUDA graph preservation mechanism, all integrated into the SGLang EP serving stack. Experimental results demonstrate that the system incurs no more than 4.4% performance overhead in static scenarios; under single-node failure, it achieves recovery and reintegration within 11 and 8 seconds of interruption, respectively, restoring 95% throughput within 52 seconds—substantially outperforming fixed-membership approaches that suffer 348 seconds of downtime.
📝 Abstract
Mixture-of-Experts (MoE) serving relies on wide expert parallelism (EP) to aggregate the memory capacity and bandwidth of many GPUs within one inference instance. This efficiency comes with a systems cost: every decoding step depends on token dispatch and combination across all active EP ranks, so even one rank failure can disrupt the entire service. Existing EP stacks handle such failures poorly because they treat membership as a fixed configuration established at initialization. The same rank set determines communicator state, expert placement, and the routing metadata baked into CUDA execution graphs, leaving the system with no way to shrink around a failure while keeping the instance valid. This paper argues that partial-failure tolerance should instead be formulated as a live EP validity problem. We present EEP, a communication and runtime substrate that represents membership as explicit, mutable runtime state. EEP repairs the specific state invalidated by a fault: it restores peer reachability without rebuilding the communication substrate, repairs lost expert coverage through a bandwidth-aware hierarchy, and reintegrates repaired ranks without forcing healthy ranks to recapture their CUDA graphs. We implement EEP in an EP serving stack integrated with SGLang and evaluate it under steady-state serving, failure recovery, and rank reintegration. The results show that explicit mutable membership preserves the steady-state fast path, staying within 4.4% of a fixed-membership DeepEP baseline under static serving, while turning a local rank fault from whole-instance downtime into two bounded interruptions. On a single-rank failure workload, EEP incurs an 11s recovery pause and an 8s reintegration pause, and restores throughput to within 95% of the pre-fault level within 52s, whereas a fixed-membership full-restart baseline remains unavailable until 348s.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
Expert Parallelism
Partial Failure
Fault Tolerance
Distributed Inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Expert Parallelism
Fault Tolerance
Mutable Membership
Distributed Inference
Xun Sun
Xun Sun
Baidu - Intelligent Driving Group (IDG)
Computer VisionAutonomous Driving
S
Shaoyuan Chen
Tsinghua University
P
Pingchuan Ma
Tsinghua University
Y
Yue Chen
Approaching AI
Z
Ziwei Yuan
Tsinghua University
Z
Zhanhao Cao
Approaching AI
Han Han
Han Han
LS2N
Musical Information RetrievalAcoustics
S
Shangming Cai
Alibaba Cloud Computing
T
Teng Ma
Alibaba Cloud Computing
X
Xuchun Shang
Alibaba Cloud Computing
X
Xinpeng Zhao
Alibaba Cloud Computing
K
Ke Yang
Approaching AI
J
Junlin Wei
JD.com
L
Lianzhi Lin
JD.com
Y
Yuji Liu
JD.com
Feng Ren
Feng Ren
Insilico Medicine Ltd
AIDDDrug Discovery & DevelopmentGenerative AI
H
Haoran Hu
Tsinghua University
Cheng Wan
Cheng Wan
Georgia Institute of Technology
Y
Yingdi Shan
Tsinghua University
Y
Yongwei Wu
Tsinghua University
M
Mingxing Zhang
Tsinghua University