VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the low inference efficiency of multimodal large language models (MLLMs) caused by redundant visual tokens, this paper proposes VISA: a method that constructs a semantic similarity–based visual token graph, employs graph neural networks to model inter-token relationships, and introduces a group-wise selection and aggregation mechanism dynamically guided by deep textual tokens to identify salient visual tokens. Semantic information from pruned tokens is preserved via graph message passing into retained nodes, enabling high-fidelity visual token compression. Evaluated on LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA, VISA significantly outperforms existing pruning approaches—maintaining or even improving downstream task performance while accelerating inference by up to 2.1×. This work establishes a novel paradigm for efficient MLLM deployment through structured, semantics-aware visual token compression.

Technology Category

Application Category

📝 Abstract

In this study, we introduce a novel method called group-wise extbf{VI}sual token extbf{S}election and extbf{A}ggregation (VISA) to address the issue of inefficient inference stemming from excessive visual tokens in multimoal large language models (MLLMs). Compared with previous token pruning approaches, our method can preserve more visual information while compressing visual tokens. We first propose a graph-based visual token aggregation (VTA) module. VTA treats each visual token as a node, forming a graph based on semantic similarity among visual tokens. It then aggregates information from removed tokens into kept tokens based on this graph, producing a more compact visual token representation. Additionally, we introduce a group-wise token selection strategy (GTS) to divide visual tokens into kept and removed ones, guided by text tokens from the final layers of each group. This strategy progressively aggregates visual information, enhancing the stability of the visual information extraction process. We conduct comprehensive experiments on LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA across various benchmarks to validate the efficacy of VISA. Our method consistently outperforms previous methods, achieving a superior trade-off between model performance and inference speed. The code is available at https://github.com/mobiushy/VISA.

Problem

Research questions and friction points this paper is trying to address.

Reduces excessive visual tokens in MLLMs

Preserves visual information while compressing tokens

Improves inference speed without sacrificing performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based token aggregation via semantic similarity

Group-wise token selection guided by text tokens

Progressive visual information aggregation for stability

🔎 Similar Papers

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference