🤖 AI Summary
To address the high computational cost and degraded edge-region accuracy of Vision Transformers (ViTs) in semantic segmentation, this paper proposes a plug-and-play, retraining-free progressive token pruning framework. Methodologically, it introduces a novel dual-clustering mechanism guided by low-level features—integrating structural cues (e.g., edges) with high-level semantics—and proposes a multi-scale Tsallis entropy-based dynamic weighting strategy for token importance scoring. This approach overcomes the limitations of conventional single-parameter entropy models and explicitly incorporates edge sensitivity into token importance assessment for the first time. Evaluated on multiple benchmarks, the method reduces FLOPs by 20–45% while incurring less than 0.3% mIoU degradation; notably, it achieves significantly superior edge-region segmentation accuracy compared to existing pruning methods.
📝 Abstract
Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.