ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work proposes ToaSt, a framework for efficient compression of Vision Transformers (ViTs) that addresses the limitations of high computational cost and the need for retraining in existing structured pruning and token compression methods. ToaSt employs a decoupled strategy to compress different ViT modules: it introduces coupled head-level structured pruning in the multi-head self-attention mechanism and adopts Token Channel Selection (TCS) in the feed-forward network. This approach effectively suppresses redundancy and noise while mitigating global propagation issues. The method significantly enhances model robustness and compression efficiency, achieving 88.52% accuracy (+1.64%) on ImageNet with a 39.4% reduction in FLOPs for ViT-MAE-Huge, and attains 52.2 mAP on COCO object detection, outperforming current baselines.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60\% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52\% accuracy (+1.64 \%) with 39.4\% FLOPs reduction. ToaSt transfers effectively to downstream tasks, cccccachieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformers

computational cost

structured pruning

token compression

optimization challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured pruning

token channel selection

Vision Transformer