π€ AI Summary
Current sparse Mixture-of-Experts (sMoE) routing mechanisms rely on similarity-based scoring, which struggles to capture the intrinsic structure of input dataβleading to a fundamental trade-off between expert specialization and computational load balancing, thereby limiting model scalability and performance. This paper proposes a decoupled probabilistic routing framework: it explicitly models input-space partitioning via an independently trained probabilistic mixture model, thereby disentangling routing decisions from downstream task optimization; further, it incorporates a dynamic sparsity activation mechanism to enable domain-aware expert selection. The approach significantly improves both expert specialization clarity and load balancing across experts. Empirical evaluation demonstrates consistent superiority over state-of-the-art sMoE baselines on multiple vision-language tasks, achieving simultaneous gains in predictive performance and expert utilization efficiency.
π Abstract
Sparse Mixture of Experts (sMoE) has become a pivotal approach for scaling large vision-language models, offering substantial capacity while maintaining computational efficiency through dynamic, sparse activation of experts. However, existing routing mechanisms, typically based on similarity scoring, struggle to effectively capture the underlying input structure. This limitation leads to a trade-off between expert specialization and balanced computation, hindering both scalability and performance. We propose Input Domain Aware MoE, a novel routing framework that leverages a probabilistic mixture model to better partition the input space. By modeling routing probabilities as a mixture of distributions, our method enables experts to develop clear specialization boundaries while achieving balanced utilization. Unlike conventional approaches, our routing mechanism is trained independently of task-specific objectives, allowing for stable optimization and decisive expert assignments. Empirical results on vision-language tasks demonstrate that our method consistently outperforms existing sMoE approaches, achieving higher task performance and improved expert utilization balance.