Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the substantial cross-node communication overhead in multi-node Mixture-of-Experts (MoE) inference, which stems from imbalanced expert loads and inefficient token routing. For the first time, it systematically characterizes three key properties of MoE expert activation: dynamic load imbalance, task-domain-dependent expert preferences, and strong correlation between the prefill and decode phases. Leveraging these insights, the authors propose a workload-aware microbatch grouping and expert placement strategy that enhances token-expert locality. Evaluated on over 100,000 real-world activation traces across multiple MoE models and datasets, the approach reduces all-to-all communication volume by up to 20×, significantly lowering decoding latency and improving accelerator utilization.

Technology Category

Application Category

📝 Abstract

Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

load imbalance

token routing

multi-node inference

inter-node communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

expert activation patterns

multi-node inference