Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts

📅 2025-10-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current sparse Mixture-of-Experts (sMoE) routing mechanisms rely on similarity-based scoring, which struggles to capture the intrinsic structure of input data—leading to a fundamental trade-off between expert specialization and computational load balancing, thereby limiting model scalability and performance. This paper proposes a decoupled probabilistic routing framework: it explicitly models input-space partitioning via an independently trained probabilistic mixture model, thereby disentangling routing decisions from downstream task optimization; further, it incorporates a dynamic sparsity activation mechanism to enable domain-aware expert selection. The approach significantly improves both expert specialization clarity and load balancing across experts. Empirical evaluation demonstrates consistent superiority over state-of-the-art sMoE baselines on multiple vision-language tasks, achieving simultaneous gains in predictive performance and expert utilization efficiency.

Technology Category

Application Category

📝 Abstract

Sparse Mixture of Experts (sMoE) has become a pivotal approach for scaling large vision-language models, offering substantial capacity while maintaining computational efficiency through dynamic, sparse activation of experts. However, existing routing mechanisms, typically based on similarity scoring, struggle to effectively capture the underlying input structure. This limitation leads to a trade-off between expert specialization and balanced computation, hindering both scalability and performance. We propose Input Domain Aware MoE, a novel routing framework that leverages a probabilistic mixture model to better partition the input space. By modeling routing probabilities as a mixture of distributions, our method enables experts to develop clear specialization boundaries while achieving balanced utilization. Unlike conventional approaches, our routing mechanism is trained independently of task-specific objectives, allowing for stable optimization and decisive expert assignments. Empirical results on vision-language tasks demonstrate that our method consistently outperforms existing sMoE approaches, achieving higher task performance and improved expert utilization balance.

Problem

Research questions and friction points this paper is trying to address.

Decoupling routing decisions from task optimization in MoE

Improving expert specialization and balanced computation trade-off

Enhancing input space partitioning with probabilistic mixture models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses probabilistic mixture model for input partitioning

Decouples routing training from task optimization

Enables expert specialization with balanced utilization

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

2024-10-03arXiv.orgCitations: 0

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

2024-08-13arXiv.orgCitations: 28

RouterRetriever: Routing over a Mixture of Expert Embedding Models

2024-09-04Citations: 0

Authors to Follow