🤖 AI Summary
To address token distribution imbalance and expert homogenization in Mixture-of-Experts (MoE) models—which constrain semantic generalization—this paper proposes a bidirectional expert-token resonance dynamic routing framework. Our method introduces: (1) a novel bidirectional resonance mechanism enabling fine-grained semantic alignment between tokens and experts; (2) adaptive lower-bound capacity control guided by dynamic token distribution analysis; (3) a joint optimization loss for orthogonal feature disentanglement and expert specialization; and (4) communication-aware local expert coordination scheduling. The approach is lightweight and efficient: it reduces per-expert token processing by 40%, accelerates training by 5.4%–46.6%, and improves performance by 9.7%–14.1% on GDAD, GPQA, and TeleQnA after supervised fine-tuning—without compromising convergence stability or model efficacy.
📝 Abstract
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs), offering unprecedented computational efficiency. However, these architectures grapple with challenges of token distribution imbalance and expert homogenization, impeding optimal semantic generalization. We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation. (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens. (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis, specifically designed to address drop-and-pad strategies. It is also integrated with orthogonal feature extraction module and an optimized loss function for expert localization. This framework effectively reduces expert homogeneity while enhancing the performance of the expert selection module. Additionally, we introduce a local expert strategy that simultaneously improves load balancing and reduces network communication overhead. It achieves a 40% reduction in token processed by each expert without compromising model convergence or efficacy. When coupled with communication optimizations, the training efficiency improvements of 5.4% to 46.6% can be observed. After supervised fine-tuning, it exhibits performance gains of 9.7% to 14.1% across GDAD, GPQA, and TeleQnA benchmarks.