Pruning and Distilling Mixture-of-Experts into Dense Language Models

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge of deploying Mixture-of-Experts (MoE) models in memory-constrained environments, where the necessity to load all expert parameters remains a critical bottleneck despite existing compression techniques that retain the MoE architecture. To fundamentally alleviate this issue, the authors propose the first end-to-end framework that systematically converts a trained MoE model into a dense counterpart. The approach constructs a dense feed-forward network through diversity-aware expert scoring, grouping, and concatenation, followed by refinement via knowledge distillation and magnitude scaling. Experimental results demonstrate that, under matched parameter budgets, the resulting dense models outperform those obtained by direct pruning by an average of 6.3 percentage points in downstream task accuracy and achieve 1.6× faster training speed, significantly surpassing current alternatives.

📝 Abstract

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

dense language models

model compression

memory-constrained deployment

knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

model compression

knowledge distillation