🤖 AI Summary
This work addresses the fundamental trade-off in sparse Mixture-of-Experts (MoE) models between load balancing and expert specialization, which often leads to routing collapse or diminished expert diversity. The authors propose Hi-MoE, a novel framework that decomposes routing into two coupled hierarchical levels: inter-group routing ensures balanced token distribution across expert groups, while intra-group routing fosters complementary expert specialization and prevents collapse. This principled redesign of router behavior consistently outperforms existing sparse routing and grouped MoE approaches across both NLP and vision benchmarks. In a 58B-token pretraining setting, Hi-MoE-7B achieves a 5.6% lower perplexity and 40% improved expert balance compared to OLMoE-7B.
📝 Abstract
Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.