FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference latency and computational overhead incurred by long visual token sequences in multimodal large language models (MLLMs) under high-resolution visual inputs, this paper proposes a training-free acceleration framework tailored for Mixture-of-Experts (MoE) architectures. Our method jointly integrates dynamic expert activation reduction with routing-aware token pruning: it identifies redundant visual tokens based on similarity in expert routing probability distributions and skips unnecessary expert computations. Evaluated on large MoE-MLLMs—including DeepSeek-VL2 and InternVL3.5—the framework achieves up to 55.0% FLOPs reduction while retaining 95.5% of original task performance, outperforming baselines such as FastV and SparseVLM. The key contribution lies in rethinking token pruning from a routing analysis perspective—departing from conventional dense-model pruning paradigms—and establishing a novel pathway for efficient MLLM deployment in resource-constrained settings.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.
Problem

Research questions and friction points this paper is trying to address.

Reducing redundant visual tokens in multimodal large language models
Accelerating mixture-of-experts MLLMs without retraining
Minimizing computational costs while maintaining model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic expert activation reduces unnecessary computation
Routing-aware token pruning removes redundant visual tokens
Training-free acceleration framework for multimodal mixture-of-experts
🔎 Similar Papers
No similar papers found.
G
Guoyang Xia
Beijing University of Posts and Telecommunications
Yifeng Ding
Yifeng Ding
University of Illinois at Urbana-Champaign
Software engineeringGenerative model
F
Fengfa Li
Li Auto
Lei Ren
Lei Ren
Li Auto
NLP、LLM、VLM
W
Wei Chen
Li Auto
Fangxiang Feng
Fangxiang Feng
Beijing University of Posts and Telecommunications
Multimodal LearningImage Synthesis
X
Xiaojie Wang
Beijing University of Posts and Telecommunications