HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

πŸ“… 2024-08-20
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 5
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
To address the low inference efficiency and high GPU memory consumption of high-resolution vision-language models (VLMs) on resource-constrained hardware, this paper proposes HiREDβ€”a plug-and-play early visual token pruning method. Its core innovation lies in the first use of ViT’s CLS token attention maps to perform region-level token importance estimation, enabling dynamic, adaptive allocation of a fixed token budget for content-aware, fine-grained visual token removal. HiRED requires no fine-tuning or retraining and is compatible with mainstream VLM architectures such as LLaVA-Next. Experiments on a Tesla P40 show that HiRED-20% achieves a 4.7Γ— throughput improvement, a 78% latency reduction, and a 14% memory saving per inference; with batch size 4, GPU memory usage drops by 30%, effectively preventing out-of-memory errors while preserving task performance without degradation.

Technology Category

Application Category

πŸ“ Abstract
High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input. Processing such a large number of visual tokens through multiple transformer networks poses significant computational challenges, particularly for resource-constrained commodity GPUs. To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget. HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly. The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM). We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods. Empirically, HiRED-20% (i.e., a 20% token budget) on LLaVA-Next-7B achieves a 4.7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits. Code - https://github.com/hasanar1f/HiRED
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
High-resolution Images
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-guided Information Selection
High-resolution Image Text Modeling
Memory-efficient Processing
πŸ”Ž Similar Papers
No similar papers found.