Hierarchical Mixture-of-Experts with Two-Stage Optimization

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the fundamental trade-off in sparse Mixture-of-Experts (MoE) models between load balancing and expert specialization, which often leads to routing collapse or diminished expert diversity. The authors propose Hi-MoE, a novel framework that decomposes routing into two coupled hierarchical levels: inter-group routing ensures balanced token distribution across expert groups, while intra-group routing fosters complementary expert specialization and prevents collapse. This principled redesign of router behavior consistently outperforms existing sparse routing and grouped MoE approaches across both NLP and vision benchmarks. In a 58B-token pretraining setting, Hi-MoE-7B achieves a 5.6% lower perplexity and 40% improved expert balance compared to OLMoE-7B.

📝 Abstract

Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.

Problem

Research questions and friction points this paper is trying to address.

Sparse Mixture-of-Experts

routing collapse

expert specialization

load balancing

hierarchical routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Mixture-of-Experts

Two-Stage Optimization

Inter-Group Balancing