SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual-language models often suffer performance degradation due to premature discarding of visually informative tokens in shallow layers, which are crucial for subsequent fine-grained reasoning. To address this, this work proposes a training-free, layer-wise independent cross-layer token bypass mechanism: while pruning tokens at specific layers, unselected tokens are preserved via a bypass path and made available for dynamic re-evaluation of their importance in deeper layers. This approach introduces a novel “bypass” pruning paradigm that mitigates irreversible early information loss and integrates inter-layer importance analysis to enable adaptive token selection. Experiments across multiple vision-language models and benchmarks demonstrate that the proposed method significantly outperforms existing techniques, achieving a superior trade-off between accuracy and efficiency while exhibiting more reliable token selection behavior.

Technology Category

Application Category

📝 Abstract
Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method that performs pruning at model-specific layers with strong visual token selection capability, while enabling independent pruning decisions across layers. Experiments across multiple VLMs and benchmarks demonstrate that SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
visual token pruning
fine-grained visual reasoning
premature pruning
cross-layer token importance
Innovation

Methods, ideas, or system contributions that make the work stand out.

token bypass
vision-language models
visual token pruning
layer-wise pruning
training-free inference
🔎 Similar Papers
No similar papers found.