Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
This work addresses the high computational cost of pretrained Vision Transformers (ViTs), which hinders their deployment on resource-constrained devices. The authors propose an efficient compression method that leverages the inherent convolutional characteristics of attention heads to automatically identify redundant heads and replace them with plug-and-play depthwise separable convolution modules. A tailored fine-tuning strategy is then employed to recover performance on downstream tasks. Evaluated on image classification and segmentation benchmarks, the approach achieves 17%–20% inference speedup with minimal accuracy degradation, substantially reducing computational overhead while preserving representational capacity.
📝 Abstract
Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Vision Foundation Models
Vision Transformer
Inference Efficiency
Resource-Constrained Devices
Model Acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

depthwise convolution
Vision Transformer acceleration
attention head replacement
drop-in module
efficient inference
🔎 Similar Papers
No similar papers found.