Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection

📅 2024-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address token distribution imbalance and expert homogenization in Mixture-of-Experts (MoE) models—which constrain semantic generalization—this paper proposes a bidirectional expert-token resonance dynamic routing framework. Our method introduces: (1) a novel bidirectional resonance mechanism enabling fine-grained semantic alignment between tokens and experts; (2) adaptive lower-bound capacity control guided by dynamic token distribution analysis; (3) a joint optimization loss for orthogonal feature disentanglement and expert specialization; and (4) communication-aware local expert coordination scheduling. The approach is lightweight and efficient: it reduces per-expert token processing by 40%, accelerates training by 5.4%–46.6%, and improves performance by 9.7%–14.1% on GDAD, GPQA, and TeleQnA after supervised fine-tuning—without compromising convergence stability or model efficacy.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs), offering unprecedented computational efficiency. However, these architectures grapple with challenges of token distribution imbalance and expert homogenization, impeding optimal semantic generalization. We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation. (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens. (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis, specifically designed to address drop-and-pad strategies. It is also integrated with orthogonal feature extraction module and an optimized loss function for expert localization. This framework effectively reduces expert homogeneity while enhancing the performance of the expert selection module. Additionally, we introduce a local expert strategy that simultaneously improves load balancing and reduces network communication overhead. It achieves a 40% reduction in token processed by each expert without compromising model convergence or efficacy. When coupled with communication optimizations, the training efficiency improvements of 5.4% to 46.6% can be observed. After supervised fine-tuning, it exhibits performance gains of 9.7% to 14.1% across GDAD, GPQA, and TeleQnA benchmarks.
Problem

Research questions and friction points this paper is trying to address.

token distribution imbalance
expert homogenization
computational efficiency in MoE architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight efficient routing mechanism
Adaptive bidirectional expert-token selection
Dynamic expert capacity lower bounds
J
Jing Li
Huawei Technologies Co., Ltd
Z
Zhijie Sun
Huawei Technologies Co., Ltd
Dachao Lin
Dachao Lin
Huawei Technologies Co., Ltd
X
Xuan He
Huawei Technologies Co., Ltd
B
Binfan Zheng
Huawei Technologies Co., Ltd
Y
Yi Lin
Huawei Technologies Co., Ltd
R
Rongqian Zhao
Huawei Technologies Co., Ltd
X
Xin Chen
Huawei Technologies Co., Ltd