Guiding Mixture-of-Experts with Temporal Multimodal Interactions

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Mixture-of-Experts (MoE) models employ routing mechanisms that neglect temporally evolving cross-modal interaction dynamics, resulting in insufficient expert specialization and constrained reasoning capabilities. To address this, we propose the first MoE routing framework explicitly guided by temporal multimodal interaction dynamics. Our approach introduces a multimodal interaction-aware router, incorporating a temporal interaction quantification mechanism to model dynamic inter-modal dependencies, and integrates a dynamic token allocation strategy for fine-grained, instance- and token-level routing. This design endows experts with generalizable interaction processing capacity, significantly enhancing semantic understanding and functional specialization. Evaluated on multiple mainstream multimodal benchmarks—including MM-Bench, MME, and SEED-Bench—our method achieves consistent performance gains over strong baselines. Moreover, it improves the interpretability and traceability of routing decisions by explicitly encoding interaction patterns into the routing process.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) architectures have become pivotal for large-scale multimodal models. However, their routing mechanisms typically overlook the informative, time-varying interaction dynamics between modalities. This limitation hinders expert specialization, as the model cannot explicitly leverage intrinsic modality relationships for effective reasoning. To address this, we propose a novel framework that guides MoE routing using quantified temporal interaction. A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions. This dynamic routing encourages experts to acquire generalizable interaction-processing skills rather than merely learning task-specific features. Our framework builds on a new formulation of temporal multimodal interaction dynamics, which are used to guide expert routing. We first demonstrate that these temporal multimodal interactions reveal meaningful patterns across applications, and then show how they can be leveraged to improve both the design and performance of MoE-based models. Comprehensive experiments on challenging multimodal benchmarks validate our approach, demonstrating both enhanced performance and improved interpretability.
Problem

Research questions and friction points this paper is trying to address.

MoE routing overlooks time-varying multimodal interaction dynamics
Limitation hinders expert specialization and effective multimodal reasoning
Proposes guiding MoE routing using quantified temporal interaction patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic routing based on temporal multimodal interactions
Multimodal interaction-aware router for token dispatch
Experts learn generalizable interaction-processing skills
🔎 Similar Papers
No similar papers found.
X
Xing Han
Johns Hopkins University, Baltimore, MD 21218, USA
Hsing-Huan Chung
Hsing-Huan Chung
The University of Texas at Austin
Machine LearningGraph Machine LearningDistribution Shift
Joydeep Ghosh
Joydeep Ghosh
(Chaired) Professor, ECE Dept., Univ. Texas at Austin; Faculty Dell Med, UT-Comp. Sc., McCombs
Machine LearningData MiningEthical AIPersonalizationAI/ML for Healthcare
P
Paul Pu Liang
Massachusetts Institute of Technology, Cambridge, MA 02139, USA
S
Suchi Saria
Johns Hopkins University, Baltimore, MD 21218, USA