Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse Mixture-of-Experts (SMoE) models suffer from high static memory overhead and poor cross-domain generalization during deployment: existing post-training pruning methods rely on single-domain corpora, causing severe performance degradation on unseen domains and necessitating repeated pruning. This paper proposes a hierarchical functional pruning framework featuring a novel “cluster-then-select” paradigm. First, experts are functionally clustered based on activation similarity across multiple task domains; then, an activation variance score quantifies each expert’s representativeness, enabling selection of a functionally complementary subset. The method achieves significant static memory compression while substantially improving cross-domain adaptability. Experiments show average gains of 7.24% on general-purpose tasks and 8.92% on specialized tasks—including mathematical reasoning and code generation—outperforming state-of-the-art pruning approaches.

Technology Category

Application Category

📝 Abstract
Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream tasks.Extensive experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24% gain on general tasks and 8.92% on specialized tasks like math reasoning and code generation.
Problem

Research questions and friction points this paper is trying to address.

Reducing static memory overhead in Mixture-of-Experts models deployment
Addressing catastrophic performance degradation when pruned models transfer domains
Eliminating costly re-pruning requirements for each new task domain
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical clustering for expert selection
Activation Variability Score for representative experts
Functionally complementary pruning for multi-domain generalization
🔎 Similar Papers
No similar papers found.