REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Sparse Mixture-of-Experts (MoE) models offer computational efficiency but suffer from functional subspace collapse during expert compression—especially in expert merging—where the router loses input-dependent control over expert selection, inducing irreversible errors. This work is the first to identify and formalize this deficiency. We propose Router-Weighted Expert Pruning (RWEP), a one-shot, fine-tuning-free compression method that jointly leverages gating weights and activation norms to dynamically assess expert importance. We theoretically prove that pruning preserves generative capability more faithfully than merging. Empirical evaluation on models ranging from 20B to 1T parameters shows that 50% expert pruning yields near-lossless performance: Qwen3-Coder-480B and Kimi-K2 achieve competitive code generation and tool-use accuracy, significantly outperforming both expert merging and alternative pruning baselines.

Technology Category

Application Category

📝 Abstract

Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we demonstrate that expert pruning is a superior strategy for generative tasks. We prove that merging introduces an irreducible error by causing a "functional subspace collapse", due to the loss of the router's independent, input-dependent control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation and tool-calling tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

Problem

Research questions and friction points this paper is trying to address.

SMoE models have excessive memory overhead requiring expert compression

Merging causes functional subspace collapse in generative tasks

REAP pruning outperforms merging for large-scale generative models

Innovation

Methods, ideas, or system contributions that make the work stand out.

REAP uses router-weighted expert activation pruning

It considers both router gate-values and activation norms

Achieves near-lossless compression at 50% expert pruning

🔎 Similar Papers

Demystifying the Compression of Mixture-of-Experts Through a Unified Framework