Unveiling Super Experts in Mixture-of-Experts Large Language Models

πŸ“… 2025-07-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing research on sparse Mixture-of-Experts (MoE) large language models lacks a systematic understanding of expert heterogeneity and the functional mechanisms of critical experts. Method: This work introduces the novel concept of β€œSuper Experts (SEs)”—a small, highly influential subset of experts that disproportionately govern model inference performance, especially on mathematical reasoning tasks. We identify SEs through expert pruning, activation pattern analysis, attention distribution probing, and a multi-task evaluation framework. Contribution/Results: We find that SEs exhibit abnormally high activation frequencies and occupy central roles in the attention sink mechanism. Empirical results show that removing even a few SEs causes significant performance degradation across tasks. This study provides new interpretability insights, principled foundations for MoE model compression, and actionable guidance for architecture design. All code is publicly released.

Technology Category

Application Category

πŸ“ Abstract
Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model's forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.
Problem

Research questions and friction points this paper is trying to address.

Identify critical Super Experts in MoE LLMs
Analyze impact of pruning Super Experts on performance
Understand role of Super Experts in attention mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies Super Experts via activation outliers
Prunes Super Experts to assess performance impact
Links Super Experts to attention sinks mechanism
πŸ”Ž Similar Papers
No similar papers found.
Z
Zunhai Su
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Qingyuan Li
Qingyuan Li
Meituan
AutoMLNeural Network CompressionHardware AccelerationLarge Language ModelAIGC
H
Hao Zhang
Meituan
Y
YuLei Qian
Meituan
Y
Yuchen Xie
Meituan
K
Kehong Yuan
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China