PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the excessive memory overhead of Mixture-of-Experts (MoE) large language models on resource-constrained devices, this paper proposes a task-aware expert pruning and retrieval framework. We first empirically reveal that expert activation in MoE exhibits strong task specificity. Leveraging this insight, we introduce the Task-Conditioned Expected Selection Score (TCESS) as a novel importance metric, enabling probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER)—a new paradigm supporting on-demand loading of critical experts. Our approach combines lightweight expert pattern pre-caching, efficient matching-based retrieval, and dynamic sub-model reconstruction for inference acceleration. Experiments demonstrate that DeepSeek-R1 671B maintains 97.2% accuracy on MATH500 under 8/128 expert pruning, while Pangu-Ultra-MoE 718B achieves 96.95% accuracy under 4/64 pruning—requiring only 390 GB GPU memory—significantly enhancing the deployability of MoE models on constrained hardware.

Technology Category

Application Category

📝 Abstract

Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in expert activation patterns within MoE layers. Building on this, we introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments. PreMoe features two main components: probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER). PEP employs a new metric, the task-conditioned expected selection score (TCESS), derived from router logits to quantify expert importance for specific tasks, thereby identifying a minimal set of critical experts. TAER leverages these task-specific expert importance profiles for efficient inference. It pre-computes and stores compact expert patterns for diverse tasks. When a user query is received, TAER rapidly identifies the most relevant stored task pattern and reconstructs the model by loading only the small subset of experts crucial for that task. This approach dramatically reduces the memory footprint across all deployment scenarios. DeepSeek-R1 671B maintains 97.2% accuracy on MATH500 when pruned to 8/128 configuration (50% expert reduction), and still achieves 72.0% with aggressive 8/32 pruning (87.5% expert reduction). Pangu-Ultra-MoE 718B achieves 97.15% on MATH500 and 81.3% on AIME24 with 8/128 pruning, while even more aggressive pruning to 4/64 (390GB memory) preserves 96.95% accuracy on MATH500. We make our code publicly available at https://github.com/JarvisPei/PreMoe.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory demands of large MoE models for deployment

Identifying task-specific critical experts to minimize resource usage

Enabling efficient inference with adaptive expert retrieval and pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic expert pruning reduces memory usage

Task-adaptive expert retrieval enhances efficiency

Compact expert patterns enable minimal subset loading

🔎 Similar Papers

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts