MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing video pretraining models predominantly rely on high-latency Vision Transformers (ViTs), hindering efficient deployment on resource-constrained mobile devices. Method: This paper introduces MobileViCLIP—the first lightweight video-text joint modeling framework tailored for mobile platforms. Its core innovation is the first application of temporal structural reparameterization to lightweight ViTs, enabling efficient extension of image-text models to the video domain. The approach integrates temporal reparameterization, a compact ViT architecture, and large-scale video-text contrastive learning. Contribution/Results: MobileViCLIP achieves competitive zero-shot classification and retrieval performance while drastically reducing computational overhead. Specifically, MobileViCLIP-Small attains a 55.4× speedup over InternVideo2-L14 on mobile inference, matches its zero-shot retrieval accuracy, and outperforms InternVideo2-S14 by 6.9% on MSR-VTT—demonstrating both state-of-the-art efficacy and exceptional efficiency for on-device video understanding.

Technology Category

Application Category

📝 Abstract

Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9% better than InternVideo2-S14 on MSR-VTT. The code is available at https://github.com/MCG-NJU/MobileViCLIP.

Problem

Research questions and friction points this paper is trying to address.

Develop efficient video-text model for mobile devices

Improve speed while maintaining zero-shot performance

Bridge gap in mobile-friendly video pre-trained models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight neural networks for mobile efficiency

Temporal structural reparameterization technique

Large-scale video-text dataset training

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs