AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) suffer from visual token redundancy and low inference efficiency when processing high-resolution, multi-crop inputs. Method: We propose a training-free, dynamic token pruning method that leverages adaptive cross-modal attention mixing—jointly modeling visual saliency and text–image similarity—to enable text-aware, precise token selection. Our approach operates solely on the pre-trained VLM’s early LLM layers, integrating original text embeddings, saliency maps, and cross-modal attention weights without requiring fine-tuning or additional parameters. Contribution/Results: The method achieves state-of-the-art acceleration: compressing over 90% of visual tokens while preserving competitive question-answering accuracy. It significantly improves throughput and latency—entirely without training overhead—thereby addressing key limitations of purely vision-driven or self-attention-based pruning strategies, which often introduce bias or retain irrelevant tokens.

Technology Category

Application Category

📝 Abstract
The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Efficiency
Feature Reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

AdaFV
Cross-domain Attention
Visual-Language Models
🔎 Similar Papers
No similar papers found.
J
Jiayi Han
Inspur Genersoft Co. Ltd., Inspur Group Co. Ltd., Shandong Key Laboratory of Automated Complex Network Software Construction
Liang Du
Liang Du
Associate Professor, Villanova University
electric power systems
Yiwen Wu
Yiwen Wu
Lehigh University
HCI
X
Xiangguo Zhou
Inspur Genersoft Co. Ltd., Inspur Group Co. Ltd., Shandong Key Laboratory of Automated Complex Network Software Construction
Hongwei Du
Hongwei Du
Inspur Genersoft Co. Ltd., Inspur Group Co. Ltd., Shandong Key Laboratory of Automated Complex Network Software Construction
W
Weibo Zheng
Inspur Genersoft Co. Ltd., Inspur Group Co. Ltd., Shandong Key Laboratory of Automated Complex Network Software Construction