Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the inference inefficiency of Large Vision-Language Models (LVLMs) on high-resolution images and long videos—caused by explosive growth in visual tokens—this paper proposes V²Drop, a dynamic visual token compression method grounded in token variability. We first discover that visual token variability within LLMs is task-agnostic; leveraging this insight, we design a variability-aware token pruning strategy that dynamically discards low-variability tokens based on internal activation changes. Unlike existing internal compression methods, V²Drop avoids positional bias and natively supports efficient attention operators such as FlashAttention. Extensive evaluations across multiple LVLMs and benchmarks demonstrate that V²Drop retains 94.0% (image) and 98.6% (video) of original task performance while reducing LLM generation latency by 31.5% and 74.2%, respectively, and significantly lowering GPU peak memory consumption.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping ( extit{i.e.}, extbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V$^2$Drop is able to maintain extbf{94.0%} and extbf{98.6%} of the original model performance for image and video understanding tasks respectively, while reducing LLM generation latency by extbf{31.5%} and extbf{74.2%}. When combined with efficient operators, V$^2$Drop further reduces GPU peak memory usage.

Problem

Research questions and friction points this paper is trying to address.

Reduces high token counts in large vision-language models

Addresses positional bias in token compression methods

Improves inference efficiency while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token dropping based on variation analysis

Progressive removal of minimal variation tokens

Maintains performance while reducing latency significantly

🔎 Similar Papers

No similar papers found.

Authors to Follow