MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mixture-of-Experts (MoE) multimodal large language models (MLLMs) suffer from low inference efficiency, and existing expert-skipping methods degrade significantly due to neglecting modality-specific characteristics and inter-layer expert heterogeneity. Method: This paper proposes MoDES, a training-free dynamic expert-skipping framework for MoE MLLMs. Its core components are: (1) Global-Modulated Local Gating (GMLG), which jointly models the differential influence of visual and linguistic tokens on expert selection; (2) Dual-Modality Thresholding (DMT), which adaptively sets distinct skipping thresholds for visual and textual branches; and (3) a monotonicity-aware frontier search algorithm that efficiently estimates expert importance and optimizes thresholds. Results: When skipping 88% of experts, MoDES achieves a 10.67% performance gain over the baseline, with prefill and decoding speeds accelerated by 2.16× and 1.26×, respectively—substantially outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$ imes$ and the decoding time by 1.26$ imes$.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational inefficiency in Mixture-of-Experts Multimodal Large Language Models
Addressing performance degradation from existing expert skipping methods in MLLMs
Improving inference speed while maintaining accuracy through adaptive expert skipping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic expert skipping for efficient MoE MLLM inference
Globally-modulated local gating mechanism for expert importance
Dual-modality thresholding with frontier search optimization
🔎 Similar Papers
No similar papers found.
Yushi Huang
Yushi Huang
Hong Kong University of Science and Technology
Efficient AI
Zining Wang
Zining Wang
Beihang University
Zhihang Yuan
Zhihang Yuan
Bytedance
Efficient AIModel CompressionMLLM
Y
Yifu Ding
Beihang University
R
Ruihao Gong
Beihang University
Jinyang Guo
Jinyang Guo
The University of Sydney
Deep LearningEfficient MethodsEdge Computing
X
Xianglong Liu
Beihang University
J
Jun Zhang
Hong Kong University of Science and Technology