ViTMAlis: Towards Latency-Critical Mobile Video Analytics with Vision Transformers

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge of excessive inference latency in vision transformers (ViTs) for delay-sensitive mobile video analytics, where high-resolution inputs hinder real-time performance. To overcome this, we propose ViTMAlis, an edge-cloud collaborative offloading framework tailored for dense prediction tasks with ViTs. ViTMAlis introduces, for the first time, a dynamic hybrid-resolution inference mechanism that adaptively adjusts input resolution based on network conditions and video content, jointly optimizing communication and computation latency. This approach breaks away from conventional CNN-based offloading paradigms and enables a flexible trade-off between accuracy and latency. Experimental results on real-world mobile and edge devices demonstrate that ViTMAlis significantly reduces end-to-end latency while improving user-perceived rendering quality.

Technology Category

Application Category

📝 Abstract

Edge-assisted mobile video analytics (MVA) applications are increasingly shifting from using vision models based on convolutional neural networks (CNNs) to those built on vision transformers (ViTs) to leverage their superior global context modeling and generalization capabilities. However, deploying these advanced models in latency-critical MVA scenarios presents significant challenges. Unlike traditional CNN-based offloading paradigms where network transmission is the primary bottleneck, ViT-based systems are constrained by substantial inference delays, particularly for dense prediction tasks where the need for high-resolution inputs exacerbates the inherent quadratic computational complexity of ViTs. To address these challenges, we propose a dynamic mixed-resolution inference strategy tailored for ViT-backboned dense prediction models, enabling flexible runtime trade-offs between speed and accuracy. Building on this, we introduce ViTMAlis, a ViT-native device-to-edge offloading framework that dynamically adapts to network conditions and video content to jointly reduce transmission and inference delays. We implement a fully functional prototype of ViTMAlis on commodity mobile and edge devices. Extensive experiments demonstrate that, compared to state-of-the-art accuracy-centric, content-aware, and latency-adaptive baselines, ViTMAlis significantly reduces end-to-end offloading latency while improving user-perceived rendering accuracy, providing a practical foundation for next-generation mobile intelligence.

Problem

Research questions and friction points this paper is trying to address.

latency-critical

mobile video analytics

vision transformers

inference delay

dense prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformers

Mobile Video Analytics

Dynamic Mixed-Resolution Inference