🤖 AI Summary
Existing vision foundation models support only discrete architectural sizes, limiting adaptability to diverse computational budgets. To address this, we propose SnapViT—a training-free, label-free elasticization method for Vision Transformers (ViTs). SnapViT leverages gradient information and cross-layer structural correlations to approximate the off-diagonal Hessian structure via evolutionary optimization, enabling self-supervised importance scoring. It then performs structured pruning to achieve continuous sparsity control. The method is broadly applicable to diverse pre-trained ViTs—including DINO, SigLIP-v2, DeiT, and AugReg—requiring less than five minutes on a single A100 GPU to generate high-performance, scalable models. Critically, it operates without fine-tuning or annotated data, and consistently outperforms state-of-the-art pruning approaches across multiple sparsity levels.
📝 Abstract
Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/