Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

In real-world MoE inference services, expert load imbalance leads to low GPU utilization and high synchronization overhead. This paper proposes Asynchronous Expert Parallelism (AEP), a novel execution paradigm addressing these bottlenecks. Its core innovations are: (1) μ-queue–based asynchronous execution with dynamic token rebatching, decoupling attention-layer computation from expert-layer computation; and (2) a defragmentation scheduler that jointly optimizes GPU-level pipelining to eliminate batch fragmentation and memory bandwidth waste. Evaluated on a prototype MoE model, AEP achieves up to 2.7× higher throughput with bounded latency. In multi-node deployments, it delivers near-linear scalability—whereas baseline two-GPU configurations show no scaling gain. To our knowledge, this is the first work to systematically resolve both load imbalance and synchronization bottlenecks in MoE inference from a systems architecture perspective.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) architectures offer the promise of larger model capacity without the prohibitive costs of fully dense designs. However, in real-world inference serving, load skew across experts often leads to suboptimal device utilization and excessive synchronization overheads. This paper introduces Asynchronous Expert Parallelism (AEP), a new paradigm that decouples layer execution from barrier-style synchronization. By dynamically queuing tokens at each layer (referred to as $mu$-queuing) and adaptively re-batching them on demand, GPUs avoid waiting for straggling experts and instead continuously process whichever layer is ready. This asynchronous approach mitigates two major inefficiencies in traditional expert-parallel systems: (1) idle GPU time while waiting for the hottest expert, and (2) small-batch executions on colder experts that waste memory bandwidth. We implement these ideas in a serving system called AMoE, which disaggregates attention from expert layers and uses a defragging scheduler to reduce batch fragmentation. Evaluations on prototype MoE models show that AMoE improves throughput by up to 2.7x compared to state-of-the-art baselines, incurring a manageable latency penalty and providing a cost-effective operating point. Furthermore, experiments demonstrate nearly linear scalability to multi-node settings, whereas the baseline system shows no throughput increase even when the number of GPUs is doubled.

Problem

Research questions and friction points this paper is trying to address.

Load skew in MoE causes poor GPU utilization

Synchronization overheads in expert-parallel systems

Small-batch executions waste memory bandwidth

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous Expert Parallelism decouples layer execution

Dynamic queuing and adaptive re-batching enhance GPU utilization

Defragging scheduler reduces batch fragmentation in AMoE system

🔎 Similar Papers

No Need to Talk: Asynchronous Mixture of Language Models