LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing Mixture-of-Experts (MoE) models fix the number of activated experts per layer, leading to computational redundancy and suboptimal inference efficiency; while post-training pruning reduces memory footprint, it rarely improves actual throughput. This paper proposes a **data-agnostic, layer-adaptive expert activation mechanism**: leveraging only weight-based analysis to assess layer importance and dynamically determine the optimal number of active experts per layer—without fine-tuning or calibration data. The method is fully compatible with high-performance inference frameworks (e.g., vLLM), enabling fine-grained, low-overhead expert scheduling. Evaluated on mainstream language and vision MoE models—including Qwen1.5-MoE—it achieves up to 10% accuracy gain at equal throughput on NVIDIA H100 GPUs, alongside significantly accelerated inference and negligible accuracy degradation. To our knowledge, this is the first approach to achieve purely weight-driven, cross-layer heterogeneous expert activation optimization.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of experts per token, offering a computationally sparse alternative to dense architectures. While prior post-training optimizations, such as inter- and intra-expert pruning, reduce memory usage they provide limited gains in inference-time compute efficiency. Moreover, existing MoE architectures typically activate a fixed number of experts uniformly across all layers, resulting in redundant computation and suboptimal performance. In this work, we first demonstrate that MoE pruning strategies improve only the memory footprint but do not significantly improve inference performance on GPU using optimized frameworks such as vLLM. To address this, we introduce LExI, a data-free optimization technique that determines the optimal number of active experts per layer in a pretrained MoE model. LExI leverages only the model weights to estimate the relative importance of each layer and adaptively assigns the number of active experts accordingly per layer. Experiments on state-of-the-art language and vision MoE benchmarks demonstrate that LExI significantly outperforms traditional MoE pruning approaches in terms of inference efficiency with negligible accuracy loss. For example, using LExI, Qwen1.5-MoE achieves the same throughput on Nvidia H100 GPU with 10% better accuracy than traditional expert pruning.

Problem

Research questions and friction points this paper is trying to address.

Optimizing MoE model inference efficiency on GPU

Reducing redundant computation in fixed-expert activation layers

Determining layer-adaptive expert counts without training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-adaptive expert activation optimization

Data-free technique using model weights

Dynamic expert allocation per layer

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions