Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video understanding models suffer from excessive computational cost, large parameter counts, and high inference latency, hindering real-time deployment on mobile devices. To address this, we propose VLM-Lite, a lightweight vision-language model for video understanding. Our method introduces two key innovations: (1) an attention-driven key-frame scoring mechanism that dynamically identifies semantically salient frames, and (2) a redundancy-aware visual token pruning projector that eliminates superfluous visual tokens before fusion. These are integrated with dual lightweight visual encoders and a compact language model (0.5B parameters). VLM-Lite significantly reduces computational overhead while enhancing representation efficiency: it cuts total parameters by 40%, accelerates inference by over 2×, and achieves a throughput of 46 tokens/sec. On six standard video understanding benchmarks, it outperforms existing state-of-the-art 0.5B models by an average of 6.0 points—marking the first instance where a smaller model surpasses larger counterparts, thereby breaking the long-standing trade-off between accuracy and latency.

Technology Category

Application Category

📝 Abstract
Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.
Problem

Research questions and friction points this paper is trying to address.

Reduces high computational requirements in video understanding models
Addresses slow inference speed for practical video analysis
Minimizes parameter count while maintaining accuracy in video tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight dual visual encoders for efficiency
Attention-Based Frame Scoring for key-frame selection
Token projector pruning redundant visual tokens
🔎 Similar Papers
No similar papers found.