Elastic ViTs from Pretrained Models without Retraining

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision foundation models support only discrete architectural sizes, limiting adaptability to diverse computational budgets. To address this, we propose SnapViT—a training-free, label-free elasticization method for Vision Transformers (ViTs). SnapViT leverages gradient information and cross-layer structural correlations to approximate the off-diagonal Hessian structure via evolutionary optimization, enabling self-supervised importance scoring. It then performs structured pruning to achieve continuous sparsity control. The method is broadly applicable to diverse pre-trained ViTs—including DINO, SigLIP-v2, DeiT, and AugReg—requiring less than five minutes on a single A100 GPU to generate high-performance, scalable models. Critically, it operates without fine-tuning or annotated data, and consistently outperforms state-of-the-art pruning approaches across multiple sparsity levels.

Technology Category

Application Category

📝 Abstract
Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/
Problem

Research questions and friction points this paper is trying to address.

Enabling elastic inference across computational budgets for Vision Transformers
Pruning pretrained models without retraining or labeled data requirements
Generating adjustable models that maintain performance across various sparsities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-pretraining pruning for elastic Vision Transformers
Evolutionary algorithm approximates Hessian off-diagonal structures
Self-supervised importance scoring without retraining or labels
🔎 Similar Papers
No similar papers found.