Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models

📅 2024-09-02
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead in multimodal large language models (MM-LLMs) caused by concatenating visual and textual tokens, this paper proposes a dynamic two-stage token pruning method. In the first stage, leveraging the long-tailed distribution of CLS token similarities, we introduce a novel inflection-point-driven dynamic visual token pruning strategy. In the second stage, cross-modal correlation modeling is employed to guide adaptive, layer-wise textual token sparsification within the LLM. The method reduces total token count to 22% of the original while preserving model accuracy, yielding substantial inference speedup. Our core contributions lie in the joint design of inflection-point identification in long-tailed similarity distributions and cross-modal collaborative pruning, achieving an optimal trade-off between computational efficiency and task performance.

Technology Category

Application Category

📝 Abstract
Recently, multimodal large language models (MM-LLMs) have achieved significant success in various tasks, but their high computational costs limit widespread application. The main computational burden arises from processing concatenated text and visual tokens in the LLM layer, where input token length directly affects efficiency. Our analysis of visual tokens reveals that their similarity to the CLS token follows a long-tail distribution, with only a few showing high similarity. To address this, we propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve, enabling effective trimming of visual markers to accelerate model performance. Additionally, we perform a second round of pruning in the LLM layer, filtering out low-correlation tokens through the interaction between visual and textual features. Experimental results demonstrate that our method achieves performance comparable to the original while utilizing only 22% of the original token quantity. Our source code will be made publicly available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Reduce computational cost in MM-LLMs
Optimize token representation dynamically
Prune low-correlation visual and textual tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic pruning algorithm optimizes token representation
Leverages long-tail distribution for efficiency
Reduces token quantity to 22%
🔎 Similar Papers
No similar papers found.
G
Gaotong Yu
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences, Beijing 100190, China
Y
Yi Chen
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences, Beijing 100190, China
J
Jian Xu
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences, Beijing 100190, China