🤖 AI Summary
Large multimodal models (LMMs) suffer from high inference latency and excessive GPU memory consumption due to redundant visual tokens. Method: This paper proposes a fine-tuning-free, diversity-driven visual token pruning method. Its core innovation is the first formulation of token pruning as a Max-Min Diversity Problem (MMDP), which optimizes pairwise distances among visual embeddings in the embedding space; a greedy algorithm approximates the solution to maximize representational dissimilarity among retained tokens, thereby fundamentally reducing redundancy. Unlike conventional importance-score-based pruning paradigms, this approach operates without gradient updates or task-specific adaptation. Contribution/Results: The method achieves state-of-the-art accuracy across 16 vision-language and video-language benchmarks. It enables zero-shot deployment with up to 50% token pruning—without any fine-tuning—yielding significant reductions in end-to-end latency and GPU memory usage.
📝 Abstract
Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available $href{https://github.com/vbdi/divprune}{ ext{here}}$.