🤖 AI Summary
To address the high memory overhead and deployment challenges of Mixture-of-Experts (MoE) large language models arising from increasing numbers of experts, this work identifies, for the first time, knowledge redundancy among experts during pretraining. We propose a task-agnostic expert grouping and pruning framework: experts are clustered into groups based on representational similarity; a knowledge-diversity-driven pruning criterion is designed to retain complementary experts; and knowledge distillation is integrated to enhance post-pruning fine-tuning. The framework is architecture-agnostic, supporting diverse MoE models including Mixtral, DeepSeek-MoE, and Qwen. Experimental results demonstrate that our method achieves up to 30% memory compression while preserving natural language understanding and generation performance—substantially outperforming existing pruning approaches. This work establishes a general, scalable paradigm for efficient deployment of sparse MoE models.
📝 Abstract
By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of Large Language Models (LLMs) without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. We will release our code to facilitate future research.