MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

📅 2025-08-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video pretraining models predominantly rely on high-latency Vision Transformers (ViTs), hindering efficient deployment on resource-constrained mobile devices. Method: This paper introduces MobileViCLIP—the first lightweight video-text joint modeling framework tailored for mobile platforms. Its core innovation is the first application of temporal structural reparameterization to lightweight ViTs, enabling efficient extension of image-text models to the video domain. The approach integrates temporal reparameterization, a compact ViT architecture, and large-scale video-text contrastive learning. Contribution/Results: MobileViCLIP achieves competitive zero-shot classification and retrieval performance while drastically reducing computational overhead. Specifically, MobileViCLIP-Small attains a 55.4× speedup over InternVideo2-L14 on mobile inference, matches its zero-shot retrieval accuracy, and outperforms InternVideo2-S14 by 6.9% on MSR-VTT—demonstrating both state-of-the-art efficacy and exceptional efficiency for on-device video understanding.

Technology Category

Application Category

📝 Abstract
Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9% better than InternVideo2-S14 on MSR-VTT. The code is available at https://github.com/MCG-NJU/MobileViCLIP.
Problem

Research questions and friction points this paper is trying to address.

Develop efficient video-text model for mobile devices
Improve speed while maintaining zero-shot performance
Bridge gap in mobile-friendly video pre-trained models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight neural networks for mobile efficiency
Temporal structural reparameterization technique
Large-scale video-text dataset training
🔎 Similar Papers
No similar papers found.
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding
Z
Zihan Jia
State Key Laboratory for Novel Software Technology, Nanjing University
Z
Zhilin Dai
State Key Laboratory for Novel Software Technology, Nanjing University
Sheng Guo
Sheng Guo
Ant Group
Computer VisionDeep LearningLLM
L
Limin Wang
State Key Laboratory for Novel Software Technology, Nanjing University; Shanghai AI Lab