FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the high inference latency and computational overhead incurred by long visual token sequences in multimodal large language models (MLLMs) under high-resolution visual inputs, this paper proposes a training-free acceleration framework tailored for Mixture-of-Experts (MoE) architectures. Our method jointly integrates dynamic expert activation reduction with routing-aware token pruning: it identifies redundant visual tokens based on similarity in expert routing probability distributions and skips unnecessary expert computations. Evaluated on large MoE-MLLMs—including DeepSeek-VL2 and InternVL3.5—the framework achieves up to 55.0% FLOPs reduction while retaining 95.5% of original task performance, outperforming baselines such as FastV and SparseVLM. The key contribution lies in rethinking token pruning from a routing analysis perspective—departing from conventional dense-model pruning paradigms—and establishing a novel pathway for efficient MLLM deployment in resource-constrained settings.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

Problem

Research questions and friction points this paper is trying to address.

Reducing redundant visual tokens in multimodal large language models

Accelerating mixture-of-experts MLLMs without retraining

Minimizing computational costs while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic expert activation reduces unnecessary computation

Routing-aware token pruning removes redundant visual tokens

Training-free acceleration framework for multimodal mixture-of-experts

🔎 Similar Papers

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models