Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

📅 2024-09-16
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
To address the high computational overhead of multimodal large language models (MLLMs) caused by redundant visual tokens, this paper proposes a training-free, budget-driven visual token pruning method. The approach formulates pruning as a statistical optimization problem aimed at minimizing the divergence between attention distributions—specifically, via KL-divergence minimization—guided by attention statistics collected from a small batch of inference data. It integrates dynamic importance scoring and greedy selection to generate a generalizable pruning strategy in seconds prior to inference. Crucially, the method requires no fine-tuning or gradient computation, and strategy generation takes only ~5 minutes. Evaluated on LLaVA-series models, it achieves up to 54.9% FLOPs reduction with merely 0.5% accuracy degradation, significantly enhancing inference efficiency and deployment feasibility.

Technology Category

Application Category

📝 Abstract
Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Token Pruning
Efficiency and Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

FitPrune
Multi-modal Model Pruning
Training-free Pruning
🔎 Similar Papers
No similar papers found.
W
Weihao Ye
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China.
Q
Qiong Wu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China.
W
Wenhao Lin
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China.
Yiyi Zhou
Yiyi Zhou
Xiamen University
deep learninglanguage and vision