RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

To address the low inference efficiency of diffusion vision-language models (DVLMs) caused by excessive visual tokens, this paper proposes a training-free, response-driven visual token pruning method. It dynamically evaluates the importance of each visual token via a mask-guided attention mechanism conditioned on response tokens and prunes redundant tokens after the first denoising step based on their cross-step consistency. This work pioneers response-driven pruning for DVLMs, offering three key advantages: no retraining required, high accuracy preservation, and dynamic inference optimization. Experiments on LLaDA-V and LaViDa demonstrate significant improvements: generation throughput increases by 186% and 28.05%, respectively, while inference latency decreases by 64.97% and 21.87%. Notably, accuracy is maintained—or even slightly improved—despite substantial computational savings.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive because they enable parallel token decoding, but the large number of visual tokens still significantly hinders their inference efficiency. While visual token pruning has been extensively studied for autoregressive VLMs (AVLMs), it remains largely unexplored for DVLMs. In this work, we propose RedVTP, a response-driven visual token pruning strategy that leverages the inference dynamics of DVLMs. Our method estimates visual token importance using attention from the masked response tokens. Based on the observation that these importance scores remain consistent across steps, RedVTP prunes the less important visual tokens from the masked tokens after the first inference step, thereby maximizing inference efficiency. Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising-and in some cases improving-accuracy.

Problem

Research questions and friction points this paper is trying to address.

Accelerating diffusion vision-language models inference without training

Reducing computational demands by pruning unimportant visual tokens

Maintaining accuracy while improving generation throughput and latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free pruning using masked token attention

Response-driven visual token importance estimation

Consistent importance scores enable early token pruning

🔎 Similar Papers

No similar papers found.