SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing visual-language models often suffer performance degradation due to premature discarding of visually informative tokens in shallow layers, which are crucial for subsequent fine-grained reasoning. To address this, this work proposes a training-free, layer-wise independent cross-layer token bypass mechanism: while pruning tokens at specific layers, unselected tokens are preserved via a bypass path and made available for dynamic re-evaluation of their importance in deeper layers. This approach introduces a novel “bypass” pruning paradigm that mitigates irreversible early information loss and integrates inter-layer importance analysis to enable adaptive token selection. Experiments across multiple vision-language models and benchmarks demonstrate that the proposed method significantly outperforms existing techniques, achieving a superior trade-off between accuracy and efficiency while exhibiting more reliable token selection behavior.

Technology Category

Application Category

📝 Abstract

Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method that performs pruning at model-specific layers with strong visual token selection capability, while enabling independent pruning decisions across layers. Experiments across multiple VLMs and benchmarks demonstrate that SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

visual token pruning

fine-grained visual reasoning

premature pruning

cross-layer token importance

Innovation

Methods, ideas, or system contributions that make the work stand out.

token bypass

vision-language models

visual token pruning