Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the high computational overhead and memory bottlenecks hindering Vision Transformer (ViT) deployment on edge devices, this paper presents a systematic survey of lightweighting and acceleration techniques tailored for edge scenarios—spanning model compression (e.g., pruning, quantization, knowledge distillation, attention simplification), software optimization (e.g., compiler frameworks such as TVM), and hardware adaptation (e.g., GPU/TPU/FPGA mapping). Its key contributions include: (1) proposing the first unified taxonomy for ViT edge deployment, explicitly characterizing trade-offs among accuracy, latency, power consumption, and hardware platforms; (2) establishing a structured evaluation framework covering 120+ works to identify real-world deployment bottlenecks; and (3) delivering a reproducible, cross-platform technical selection guide to advance co-optimization of accuracy, latency, and power efficiency.

Technology Category

Application Category

📝 Abstract

In recent years, vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks such as image classification, object detection, and segmentation. Unlike convolutional neural networks (CNNs), which rely on hierarchical feature extraction, ViTs treat images as sequences of patches and leverage self-attention mechanisms. However, their high computational complexity and memory demands pose significant challenges for deployment on resource-constrained edge devices. To address these limitations, extensive research has focused on model compression techniques and hardware-aware acceleration strategies. Nonetheless, a comprehensive review that systematically categorizes these techniques and their trade-offs in accuracy, efficiency, and hardware adaptability for edge deployment remains lacking. This survey bridges this gap by providing a structured analysis of model compression techniques, software tools for inference on edge, and hardware acceleration strategies for ViTs. We discuss their impact on accuracy, efficiency, and hardware adaptability, highlighting key challenges and emerging research directions to advance ViT deployment on edge platforms, including graphics processing units (GPUs), tensor processing units (TPUs), and field-programmable gate arrays (FPGAs). The goal is to inspire further research with a contemporary guide on optimizing ViTs for efficient deployment on edge devices.

Problem

Research questions and friction points this paper is trying to address.

High computational complexity of vision transformers.

Memory demands for edge device deployment.

Lack of comprehensive review on model compression.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model compression techniques for ViTs

Software tools for edge inference

Hardware acceleration strategies for ViTs

🔎 Similar Papers

No similar papers found.