MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

248K/year
🤖 AI Summary
This work addresses the efficiency bottlenecks in multimodal mixture-of-experts (MoE) large language models caused by imbalanced computational loads and heterogeneous information between visual and textual modalities during inference. To this end, the authors propose MACS, a training-free inference framework that introduces, for the first time, an entropy-based semantic weighting mechanism for visual tokens and dynamically adjusts expert capacity according to the input modality composition. This enables semantics-aware resource allocation, effectively mitigating expert load imbalance in multimodal settings. Extensive experiments demonstrate that MACS significantly outperforms existing approaches across multiple benchmarks, substantially improving both inference efficiency and deployment robustness of MoE-based multimodal large models.
📝 Abstract
Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-count-based load balancing methods fail to address two unique challenges: (1) Information Heterogeneity, where numerous redundant visual tokens are treated equally to semantically critical ones, and (2) Modality Dynamics, where varying visual to text ratios across tasks lead to resource misallocation. To address these challenges, we propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework. Specifically, MACS introduces an Entropy-Weighted Load mechanism to quantify the semantic value of visual tokens, addressing information heterogeneity. Additionally, the Dynamic Modality-Adaptive Capacity mechanism allocates expert resources based on the real-time modal composition of the input. Extensive experiments demonstrate that MACS significantly outperforms existing methods on various multimodal benchmarks, providing a novel and robust solution for the efficient deployment of MoE MLLMs in EP inference.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
Multimodal Large Language Models
Expert Parallelism
Load Balancing
Straggler Effect
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-Aware Capacity Scaling
Mixture-of-Experts
Multimodal Large Language Models
Expert Parallelism
Entropy-Weighted Load
🔎 Similar Papers
No similar papers found.