Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant inference latency of Mixture-of-Experts (MoE) models caused by activating a large number of experts, which becomes a critical bottleneck under resource-constrained settings, while existing sparse activation methods often degrade model performance. The authors propose Alloc-MoE, a framework that jointly optimizes inter-layer and token-level expert activation under a fixed activation budget. It determines optimal per-layer activation quotas via sensitivity analysis and dynamic programming (Alloc-L) and dynamically reallocates token activations based on routing scores (Alloc-T) to minimize performance loss. Alloc-MoE introduces, for the first time, an explicit activation budget constraint and establishes a unified two-level dynamic scheduling mechanism that substantially improves inference efficiency without compromising model accuracy. Experiments on DeepSeek-V2-Lite show that with only half the original activation budget, Alloc-MoE achieves 1.15× faster prefill and 1.34× faster decode speeds while preserving model performance.
📝 Abstract
Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves $1.15\times$ prefill and $1.34\times$ decode speedups on DeepSeek-V2-Lite at half of the original budget.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
expert activation
inference latency
activation budget
resource-constrained deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
activation budget
expert allocation
efficient inference
dynamic programming
🔎 Similar Papers
No similar papers found.
B
Baihui Liu
National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology
K
Kaiyuan Tian
National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology
W
Wei Wang
National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology
Zhaoning Zhang
Zhaoning Zhang
National University of Defense Technology
MLSysCompute VisionDistributed Computing
Linbo Qiao
Linbo Qiao
NUDT
Stochastic OptimizationDistributed OptimizationLarge-scale Machine Learning
Dongsheng Li
Dongsheng Li
Professor, School of Computer Science, National University of Defense Technology
Distributed ComputingParallel ComputingCloud ComputingPeer-to-Peer ComputingBig Data