Does a Global Perspective Help Prune Sparse MoEs Elegantly?

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse Mixture-of-Experts (MoE) models improve computational efficiency but still incur substantial memory overhead due to the total number of expert parameters, and existing pruning methods often overlook inter-layer differences in redundancy. This work proposes GRAPE, a global redundancy-aware expert pruning strategy that introduces cross-layer redundancy analysis for the first time, enabling dynamic allocation of layer-wise pruning budgets and overcoming the limitations of conventional uniform or local pruning approaches. GRAPE is applicable to mainstream architectures such as Mixtral, DeepSeek-MoE, and Qwen-MoE. Under identical pruning budgets, it achieves an average accuracy improvement of 1.40% over the strongest local baseline, with gains reaching up to 2.45%, thereby significantly enhancing both pruning efficacy and model performance.
📝 Abstract
Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average performance. On the three main models reported in the paper, it improves average accuracy over the strongest local baseline by 1.40% on average across pruning settings, with gains of up to 2.45%.
Problem

Research questions and friction points this paper is trying to address.

Sparse Mixture-of-Experts
model pruning
memory consumption
heterogeneous redundancy
pruning budget allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Mixture-of-Experts
Global Pruning
Redundancy-Aware
Dynamic Budget Allocation
Model Compression