Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

📅 2024-07-12
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory overhead and deployment challenges of Mixture-of-Experts (MoE) large language models arising from increasing numbers of experts, this work identifies, for the first time, knowledge redundancy among experts during pretraining. We propose a task-agnostic expert grouping and pruning framework: experts are clustered into groups based on representational similarity; a knowledge-diversity-driven pruning criterion is designed to retain complementary experts; and knowledge distillation is integrated to enhance post-pruning fine-tuning. The framework is architecture-agnostic, supporting diverse MoE models including Mixtral, DeepSeek-MoE, and Qwen. Experimental results demonstrate that our method achieves up to 30% memory compression while preserving natural language understanding and generation performance—substantially outperforming existing pruning approaches. This work establishes a general, scalable paradigm for efficient deployment of sparse MoE models.

Technology Category

Application Category

📝 Abstract
By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of Large Language Models (LLMs) without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. We will release our code to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Redundant expert knowledge in MoE models increases memory consumption
Pruning similar experts improves parameter efficiency in MoE architectures
Method outperforms existing pruning techniques on multiple NLP tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grouping and pruning similar experts
Improving parameter efficiency in MoE
Validated on Mixtral, Deepseek-MoE, Qwen
🔎 Similar Papers
No similar papers found.