🤖 AI Summary
Vision Transformers (ViTs) suffer from high inference latency on mobile devices, yet existing work lacks systematic, empirical latency analysis across diverse architectures and platforms.
Method: We conduct the first large-scale, real-world benchmarking study—evaluating 190 ViT and 102 CNN models across six mobile platforms using TensorFlow Lite and PyTorch Mobile. To address data scarcity, we propose a synthetic modeling approach to generate a diverse latency dataset comprising 1,000 ViT architectures. Leveraging this dataset, we design a generalizable latency prediction model capable of estimating inference latency for unseen ViT architectures with low error—meeting practical deployment requirements.
Contribution/Results: This work introduces the first large-scale, cross-platform, open-source ViT latency dataset. It identifies key architectural factors governing mobile ViT latency and provides a reusable methodology—grounded in empirical evidence—for efficient model selection and deployment on resource-constrained devices.
📝 Abstract
Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.