Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low capacity utilization in sparse Mixture-of-Experts (MoE) models caused by expert overlap and routing ambiguity. The authors propose two plug-and-play regularization losses: an intra-layer specialization loss based on the cosine similarity of SwiGLU activations to encourage expert decorrelation within each layer, and an inter-layer coupling loss that maximizes the joint probability of top-k routing decisions across adjacent layers to promote routing consistency. Notably, these mechanisms are introduced without modifying the model architecture or router design. The approach is compatible with both DeepSeekMoE and standard top-k MoE architectures and is integrated into Megatron-LM. Experiments demonstrate consistent performance gains across pretraining, fine-tuning, and zero-shot evaluation, achieving higher expert specialization, lower routing entropy, more stable expert paths, and consequently faster inference.

Technology Category

Application Category

📝 Abstract
Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts'SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
expert overlap
routing ambiguity
model capacity underutilization
expert specialization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
expert specialization
regularization loss
cross-layer coupling
routing efficiency
🔎 Similar Papers
No similar papers found.
R
Rizhen Hu
Peking University, Beijing, China
Y
Yuan Cao
Peking University, Beijing, China
B
Boao Kong
Peking University, Beijing, China
M
Mou Sun
Zhejiang Lab, Zhejiang, China
Kun Yuan
Kun Yuan
Center for Machine Learning Research, Peking University
distributed signal processinglarge-scale optimizationmachine learning