JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low inference efficiency of Vision Transformers (ViTs) in high-resolution visual tasks by proposing JetViT, a post-training attention search framework that automatically optimizes ViT attention architectures without requiring retraining. JetViT transforms a pre-trained full-attention ViT into an efficient hybrid architecture that integrates linear attention with window-based attention, achieving substantial gains in inference speed while preserving model accuracy. Experimental results on DINOv3 and DepthAnythingV2 demonstrate that JetViT can attain up to a 1.79× increase in throughput and a 44.81% reduction in latency, highlighting its effectiveness in accelerating high-resolution vision models with no loss in performance.
📝 Abstract
We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformer
high-resolution
inference efficiency
attention mechanism
model acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-Training Attention Search
Hybrid-Attention Vision Transformer
Linear Attention
Window Attention
High-Resolution Inference
🔎 Similar Papers
No similar papers found.