Efficient Deployment of Vision-Language Models on Mobile Devices: A Case Study on OnePlus 13R

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

To address computational constraints, poor energy efficiency, and insufficient real-time performance in deploying vision-language models (VLMs) on mobile devices, this work conducts a systematic end-to-end performance evaluation of three leading inference frameworks—llama.cpp, MLC-LLM, and mllm—on state-of-the-art VLMs (LLaVA, MobileVLM, and Imp) using the OnePlus 13R as the hardware platform. We develop a full-stack benchmarking toolkit measuring CPU/GPU/NPU utilization, power consumption, thermal behavior, and end-to-end latency. Our analysis uncovers a previously unreported cross-stage hardware-resource mismatch: GPU saturation during image encoding, severe CPU bottlenecks during text generation, and low, highly volatile NPU utilization. The core contribution is a hardware-aware, VLM-specific analytical methodology for mobile deployment, complemented by an open-source lightweight monitoring tool. This work provides empirically grounded insights and concrete optimization directions for efficient on-device VLM inference.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) offer promising capabilities for mobile devices, but their deployment faces significant challenges due to computational limitations and energy inefficiency, especially for real-time applications. This study provides a comprehensive survey of deployment frameworks for VLMs on mobile devices, evaluating llama.cpp, MLC-Imp, and mllm in the context of running LLaVA-1.5 7B, MobileVLM-3B, and Imp-v1.5 3B as representative workloads on a OnePlus 13R. Each deployment framework was evaluated on the OnePlus 13R while running VLMs, with measurements covering CPU, GPU, and NPU utilization, temperature, inference time, power consumption, and user experience. Benchmarking revealed critical performance bottlenecks across frameworks: CPU resources were consistently over-utilized during token generation, while GPU and NPU accelerators were largely unused. When the GPU was used, primarily for image feature extraction, it was saturated, leading to degraded device responsiveness. The study contributes framework-level benchmarks, practical profiling tools, and an in-depth analysis of hardware utilization bottlenecks, highlighting the consistent overuse of CPUs and the ineffective or unstable use of GPUs and NPUs in current deployment frameworks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing VLM deployment on mobile devices for efficiency

Addressing CPU overuse and GPU/NPU underutilization in VLMs

Evaluating performance bottlenecks in mobile VLM frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated llama.cpp, MLC-Imp, mllm for VLM deployment

Profiled CPU, GPU, NPU utilization on OnePlus 13R

Identified CPU overuse, GPU/NPU inefficiency bottlenecks

🔎 Similar Papers

Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices