Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the high computational cost of pretrained Vision Transformers (ViTs), which hinders their deployment on resource-constrained devices. The authors propose an efficient compression method that leverages the inherent convolutional characteristics of attention heads to automatically identify redundant heads and replace them with plug-and-play depthwise separable convolution modules. A tailored fine-tuning strategy is then employed to recover performance on downstream tasks. Evaluated on image classification and segmentation benchmarks, the approach achieves 17%–20% inference speedup with minimal accuracy degradation, substantially reducing computational overhead while preserving representational capacity.

📝 Abstract

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Vision Foundation Models

Vision Transformer

Inference Efficiency

Resource-Constrained Devices

Model Acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

depthwise convolution

Vision Transformer acceleration

attention head replacement