Faster MoE LLM Inference for Extremely Large Models

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inference efficiency bottleneck of fine-grained sparse Mixture-of-Experts (MoE) large language models under dynamic service workloads. We systematically investigate how the number of activated experts per token and the total number of experts jointly govern the efficiency–accuracy trade-off. We propose an end-to-end MoE inference acceleration framework integrating dynamic expert routing control, load-aware inference scheduling, and fine-grained computational graph optimization. Our key finding is that moderately reducing the number of activated experts per token improves throughput by over 10% with negligible accuracy degradation, whereas decreasing the total number of experts incurs substantial performance loss. Under zero-precision-loss constraints, our framework achieves ≥10% end-to-end inference throughput improvement. This establishes a novel paradigm for efficient deployment of ultra-large-scale MoE models.

Technology Category

Application Category

📝 Abstract
Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE architectures. With the emergence of DeepSeek Models, fine-grained MoE models are gaining popularity, yet research on them remains limited. Therefore, we want to discuss the efficiency dynamic under different service loads. Additionally, fine-grained models allow deployers to reduce the number of routed experts, both activated counts and total counts, raising the question of how this reduction affects the trade-off between MoE efficiency and performance. Our findings indicate that while deploying MoE models presents greater challenges, it also offers significant optimization opportunities. Reducing the number of activated experts can lead to substantial efficiency improvements in certain scenarios, with only minor performance degradation. Reducing the total number of experts provides limited efficiency gains but results in severe performance degradation. Our method can increase throughput by at least 10% without any performance degradation. Overall, we conclude that MoE inference optimization remains an area with substantial potential for exploration and improvement.
Problem

Research questions and friction points this paper is trying to address.

Optimizing fine-grained MoE LLM inference efficiency under varying service loads
Exploring trade-offs between expert reduction and MoE performance degradation
Improving throughput without sacrificing model performance in MoE systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained MoE models optimize expert routing
Reducing activated experts improves efficiency significantly
Throughput increases by 10% without performance loss
🔎 Similar Papers
No similar papers found.