🤖 AI Summary
To address the straggler effect in Mixture-of-Experts (MoE) models during expert-parallel inference—caused by imbalanced token allocation and leading to expert load imbalance, low resource utilization, and increased latency—this paper proposes a capacity-aware inference framework. Methodologically, it (1) explicitly models inference latency upper bounds as an optimization objective; (2) introduces capacity-aware token dropping and graph-matching-based rerouting, jointly leveraging real-time expert load monitoring and lightweight capacity prediction to dynamically balance expert loads; and (3) enables proactive load regulation without compromising accuracy. Evaluated on Mixtral-8×7B-Instruct, the framework achieves a 1.94× speedup in inference throughput, significantly reduces P99 latency, improves GPU utilization, and yields a 0.2% average performance gain.
📝 Abstract
The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation, optimizing the trade-off between performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized. This imbalance leads to poor resource utilization and increased latency, as the most burdened expert dictates the overall delay, a phenomenon we define as the extbf{ extit{Straggler Effect}}. To mitigate this, we propose Capacity-Aware Inference, including two key techniques: (1) extbf{ extit{Capacity-Aware Token Drop}}, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) extbf{ extit{Capacity-Aware Token Reroute}}, which reallocates overflowed tokens to underutilized experts, balancing the token distribution. These techniques collectively optimize both high-load and low-load expert utilization, leading to a more efficient MoE inference pipeline. Extensive experiments demonstrate the effectiveness of our methods, showing significant improvements in inference efficiency, e.g., 0.2% average performance increase and a 1.94$ imes$ inference speedup on Mixtral-8$ imes$7B-Instruct.