SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the lack of systematic investigation into efficient compression of Mixture-of-Experts (MoE) models during large-scale pretraining. We systematically explore the synergistic integration of structured pruning and knowledge distillation in MoE pretraining and introduce three key innovations: post-pruning reinitialization that outperforms training from scratch, a partially preserved expert merging strategy, and a multi-token prediction distillation approach. We further demonstrate that progressive pruning significantly surpasses one-shot compression. Applying our method to the Qwen3-Next-80A3B model, we compress it to 23A2B under identical training budgets while achieving downstream task performance closely matching that of the original model, thereby validating the effectiveness and competitiveness of the proposed framework.

📝 Abstract

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

Model Compression

Structured Pruning

Knowledge Distillation

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts (MoE)

structured pruning

knowledge distillation