ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the substantial computational overhead in vision-language models caused by the explosion of visual tokens in high-resolution images, a challenge exacerbated by the inability of existing pruning methods to adapt to varying image complexity. To tackle this, the authors propose a novel two-stage adaptive visual token pruning framework that dynamically adjusts its pruning strategy based on the content complexity of the input image, effectively balancing redundancy reduction with preservation of critical information. Evaluated on Qwen2.5-VL-7B, the method achieves an 85% token pruning rate while retaining 89.46% of the original model’s accuracy—significantly outperforming the current state-of-the-art approach, which attains only 78.1% accuracy under similar compression—thereby markedly enhancing the efficiency of visual token compression.

📝 Abstract

Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85\%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Token Pruning

Computational Overhead

Visual Redundancy

Adaptive Pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive token pruning

vision-language models

visual redundancy elimination