Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the significant inference latency of Mixture-of-Experts (MoE) models caused by activating a large number of experts, which becomes a critical bottleneck under resource-constrained settings, while existing sparse activation methods often degrade model performance. The authors propose Alloc-MoE, a framework that jointly optimizes inter-layer and token-level expert activation under a fixed activation budget. It determines optimal per-layer activation quotas via sensitivity analysis and dynamic programming (Alloc-L) and dynamically reallocates token activations based on routing scores (Alloc-T) to minimize performance loss. Alloc-MoE introduces, for the first time, an explicit activation budget constraint and establishes a unified two-level dynamic scheduling mechanism that substantially improves inference efficiency without compromising model accuracy. Experiments on DeepSeek-V2-Lite show that with only half the original activation budget, Alloc-MoE achieves 1.15× faster prefill and 1.34× faster decode speeds while preserving model performance.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves $1.15\times$ prefill and $1.34\times$ decode speedups on DeepSeek-V2-Lite at half of the original budget.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

expert activation

inference latency

activation budget

resource-constrained deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

activation budget

expert allocation