Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address the high computational cost and degraded edge-region accuracy of Vision Transformers (ViTs) in semantic segmentation, this paper proposes a plug-and-play, retraining-free progressive token pruning framework. Methodologically, it introduces a novel dual-clustering mechanism guided by low-level features—integrating structural cues (e.g., edges) with high-level semantics—and proposes a multi-scale Tsallis entropy-based dynamic weighting strategy for token importance scoring. This approach overcomes the limitations of conventional single-parameter entropy models and explicitly incorporates edge sensitivity into token importance assessment for the first time. Evaluated on multiple benchmarks, the method reduces FLOPs by 20–45% while incurring less than 0.3% mIoU degradation; notably, it achieves significantly superior edge-region segmentation accuracy compared to existing pruning methods.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.

Problem

Research questions and friction points this paper is trying to address.

Reduces computation in Vision Transformers for resource-constrained devices

Integrates low-level visual features for precise semantic segmentation

Overcomes limitations of traditional entropy in token pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive token pruning with multi-scale Tsallis entropy

Low-level visual features guided twice clustering

Dynamic scoring mechanism for edge preservation

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference