SD-MoE: Spectral Decomposition for Effective Expert Specialization

📅 2026-02-13

📈 Citations: 0

✨ Influential: 0

📄 PDF

career value

207K/year

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performance across downstream tasks, enables effective expert specialization, incurring minimal additional computation, and can be seamlessly integrated into a wide range of existing MoE architectures, including Qwen and DeepSeek.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

expert specialization

spectral decomposition

parameter overlap

gradient alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral Decomposition

Mixture-of-Experts

Expert Specialization