DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing MoE pruning methods employ uniform sparsity across all layers, ignoring inter-layer variations in expert redundancy—leading to suboptimal model performance. To address this, we propose a layer-adaptive, differentiable expert pruning framework: it reformulates discrete pruning search as an end-to-end continuous optimization via Gumbel-Softmax relaxation; integrates a differentiable gating mechanism with a hierarchical importance learning module to jointly model and dynamically allocate layer-wise pruning ratios. This work is the first to achieve non-uniform, differentiable MoE architecture compression. Evaluated on Mixtral 8×7B, our method retains 92% of the original performance while pruning 50% of experts. On benchmarks including MMLU, it outperforms state-of-the-art pruning approaches by up to 7.1%, significantly alleviating memory and storage overhead in large MoE models.

Technology Category

Application Category

📝 Abstract

Despite the significant breakthrough of Mixture-of-Experts (MoE), the increasing scale of these MoE models presents huge memory and storage challenges. Existing MoE pruning methods, which involve reducing parameter size with a uniform sparsity across all layers, often lead to suboptimal outcomes and performance degradation due to varying expert redundancy in different MoE layers. To address this, we propose a non-uniform pruning strategy, dubbed extbf{Di}fferentiable extbf{E}xpert extbf{P}runing ( extbf{DiEP}), which adaptively adjusts pruning rates at the layer level while jointly learning inter-layer importance, effectively capturing the varying redundancy across different MoE layers. By transforming the global discrete search space into a continuous one, our method handles exponentially growing non-uniform expert combinations, enabling adaptive gradient-based pruning. Extensive experiments on five advanced MoE models demonstrate the efficacy of our method across various NLP tasks. Notably, extbf{DiEP} retains around 92% of original performance on Mixtral 8$ imes$7B with only half the experts, outperforming other pruning methods by up to 7.1% on the challenging MMLU dataset.

Problem

Research questions and friction points this paper is trying to address.

Adaptively pruning Mixture-of-Experts models to reduce memory

Addressing varying expert redundancy across different MoE layers

Enabling gradient-based pruning while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable expert pruning for adaptive compression

Layer-level pruning rate adjustment via gradient optimization

Continuous transformation of discrete expert search space

🔎 Similar Papers

Demystifying the Compression of Mixture-of-Experts Through a Unified Framework