A Study on Inference Latency for Vision Transformers on Mobile Devices

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Vision Transformers (ViTs) suffer from high inference latency on mobile devices, yet existing work lacks systematic, empirical latency analysis across diverse architectures and platforms. Method: We conduct the first large-scale, real-world benchmarking study—evaluating 190 ViT and 102 CNN models across six mobile platforms using TensorFlow Lite and PyTorch Mobile. To address data scarcity, we propose a synthetic modeling approach to generate a diverse latency dataset comprising 1,000 ViT architectures. Leveraging this dataset, we design a generalizable latency prediction model capable of estimating inference latency for unseen ViT architectures with low error—meeting practical deployment requirements. Contribution/Results: This work introduces the first large-scale, cross-platform, open-source ViT latency dataset. It identifies key architectural factors governing mobile ViT latency and provides a reusable methodology—grounded in empirical evidence—for efficient model selection and deployment on resource-constrained devices.

Technology Category

Application Category

📝 Abstract

Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Quantitatively study vision transformers' mobile performance characteristics

Compare latency factors between ViTs and CNNs on mobile devices

Develop dataset to predict new ViT architectures' inference latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantitatively studied 190 real-world vision transformers

Developed dataset with 1000 synthetic ViTs latencies

Predicted new ViT inference latency with accuracy

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration