DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal models (LMMs) suffer from high inference latency and excessive GPU memory consumption due to redundant visual tokens. Method: This paper proposes a fine-tuning-free, diversity-driven visual token pruning method. Its core innovation is the first formulation of token pruning as a Max-Min Diversity Problem (MMDP), which optimizes pairwise distances among visual embeddings in the embedding space; a greedy algorithm approximates the solution to maximize representational dissimilarity among retained tokens, thereby fundamentally reducing redundancy. Unlike conventional importance-score-based pruning paradigms, this approach operates without gradient updates or task-specific adaptation. Contribution/Results: The method achieves state-of-the-art accuracy across 16 vision-language and video-language benchmarks. It enables zero-shot deployment with up to 50% token pruning—without any fine-tuning—yielding significant reductions in end-to-end latency and GPU memory usage.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available $href{https://github.com/vbdi/divprune}{ ext{here}}$.
Problem

Research questions and friction points this paper is trying to address.

Reduces high latency in Large Multimodal Models by pruning visual tokens.
Maximizes token diversity to improve representation without fine-tuning.
Decreases GPU memory usage and end-to-end latency in LMMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formulates token pruning as Max-Min Diversity Problem
Maximizes diversity among selected visual tokens
Reduces latency and GPU memory usage effectively
🔎 Similar Papers
No similar papers found.