🤖 AI Summary
This work addresses the substantial cross-node communication overhead in multi-node Mixture-of-Experts (MoE) inference, which stems from imbalanced expert loads and inefficient token routing. For the first time, it systematically characterizes three key properties of MoE expert activation: dynamic load imbalance, task-domain-dependent expert preferences, and strong correlation between the prefill and decode phases. Leveraging these insights, the authors propose a workload-aware microbatch grouping and expert placement strategy that enhances token-expert locality. Evaluated on over 100,000 real-world activation traces across multiple MoE models and datasets, the approach reduces all-to-all communication volume by up to 20×, significantly lowering decoding latency and improving accelerator utilization.
📝 Abstract
Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead.
To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.