Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Sparse Mixture-of-Experts (SMoE) models suffer from high memory overhead and deployment challenges due to the need to load all expert parameters. Existing compression methods primarily focus on expert-level pruning, neglecting neuron-level structural optimization, and fail to address semantic conflicts among experts—hindering direct merging. This paper proposes DERN, a retraining-free framework for expert pruning and neuron-level recomposition. First, redundant experts are pruned based on router statistics. Second, retained experts are decomposed into neuron-level fragments, which are then reallocated to the most compatible experts according to compatibility metrics. Finally, fragments are fused to construct a compact model. DERN is the first method to reconstruct expert architecture at the neuron granularity, mitigate inter-expert semantic conflicts, and enable task-agnostic compression. At 50% expert sparsity, it improves average accuracy by over 5% on commonsense reasoning and MMLU benchmarks for Mixtral, Qwen, and DeepSeek, while significantly reducing memory footprint and expert count.

Technology Category

Application Category

📝 Abstract

Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.

Problem

Research questions and friction points this paper is trying to address.

Reducing high memory usage in sparse Mixture-of-Experts LLMs

Addressing neuron-level semantic conflicts in expert merging

Enabling retraining-free expert pruning for efficient deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes redundant experts using router statistics

Decomposes experts into neuron-level segments

Merges segments into compact expert representations

🔎 Similar Papers

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

2024-07-12arXiv.orgCitations: 9

Together AI

$160,000 - $230,000 + equity + benefits

San Francisco, Singapore, Amsterdam / Remote

Authors to Follow