🤖 AI Summary
To address the inference inefficiency of Large Vision-Language Models (LVLMs) on high-resolution images and long videos—caused by explosive growth in visual tokens—this paper proposes V²Drop, a dynamic visual token compression method grounded in token variability. We first discover that visual token variability within LLMs is task-agnostic; leveraging this insight, we design a variability-aware token pruning strategy that dynamically discards low-variability tokens based on internal activation changes. Unlike existing internal compression methods, V²Drop avoids positional bias and natively supports efficient attention operators such as FlashAttention. Extensive evaluations across multiple LVLMs and benchmarks demonstrate that V²Drop retains 94.0% (image) and 98.6% (video) of original task performance while reducing LLM generation latency by 31.5% and 74.2%, respectively, and significantly lowering GPU peak memory consumption.
📝 Abstract
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping ( extit{i.e.}, extbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V$^2$Drop is able to maintain extbf{94.0%} and extbf{98.6%} of the original model performance for image and video understanding tasks respectively, while reducing LLM generation latency by extbf{31.5%} and extbf{74.2%}. When combined with efficient operators, V$^2$Drop further reduces GPU peak memory usage.