🤖 AI Summary
To address the low inference efficiency of diffusion vision-language models (DVLMs) caused by excessive visual tokens, this paper proposes a training-free, response-driven visual token pruning method. It dynamically evaluates the importance of each visual token via a mask-guided attention mechanism conditioned on response tokens and prunes redundant tokens after the first denoising step based on their cross-step consistency. This work pioneers response-driven pruning for DVLMs, offering three key advantages: no retraining required, high accuracy preservation, and dynamic inference optimization. Experiments on LLaDA-V and LaViDa demonstrate significant improvements: generation throughput increases by 186% and 28.05%, respectively, while inference latency decreases by 64.97% and 21.87%. Notably, accuracy is maintained—or even slightly improved—despite substantial computational savings.
📝 Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive because they enable parallel token decoding, but the large number of visual tokens still significantly hinders their inference efficiency. While visual token pruning has been extensively studied for autoregressive VLMs (AVLMs), it remains largely unexplored for DVLMs. In this work, we propose RedVTP, a response-driven visual token pruning strategy that leverages the inference dynamics of DVLMs. Our method estimates visual token importance using attention from the masked response tokens. Based on the observation that these importance scores remain consistent across steps, RedVTP prunes the less important visual tokens from the masked tokens after the first inference step, thereby maximizing inference efficiency. Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising-and in some cases improving-accuracy.