Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the high inference latency and computational overhead in multimodal large language models (MLLMs) caused by generating massive visual tokens from high-resolution images, this paper proposes a training-free, parallel visual token scheduling framework. The method semantically categorizes visual tokens into “subject” and “non-subject” groups and introduces a parallel semantic fusion mechanism that dynamically prunes non-subject token paths during inference, enabling efficient token compression while preserving critical contextual information. Crucially, it requires no additional parameters or heuristic rules and is architecture-agnostic, compatible with mainstream MLLMs. Experiments demonstrate that the approach achieves up to 88.9% visual token pruning, yielding a 1.77× inference speedup and a 70% reduction in FLOPs, all without sacrificing task accuracy.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with diverse existing MLLM architectures. Experiments across multiple MLLM backbones show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.

Problem

Research questions and friction points this paper is trying to address.

Reducing multimodal LLM inference latency from excessive visual tokens

Pruning visual tokens without losing essential contextual information

Accelerating computation while maintaining model accuracy across architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Partitions visual tokens into subject and non-subject groups

Processes token groups in parallel for semantic transfer

Discards non-subject path mid-inference to reduce computation

🔎 Similar Papers

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference