A Survey on Efficient Vision-Language Models

๐Ÿ“… 2025-04-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high computational overhead and memory bottlenecks hindering the deployment of vision-language models (VLMs) on edge and resource-constrained devices, this paper presents a systematic survey of efficient VLM optimization techniques. We propose the first comprehensive taxonomy covering six key directions: model pruning, quantization, knowledge distillation, lightweight attention mechanisms, multimodal adapters, and hardware-software co-optimization. Over one hundred state-of-the-art works are critically reviewed and empirically analyzed in terms of their latency, GPU memory footprint, and accuracy trade-offs. Furthermore, we establish an actively maintained, open-source GitHub repository to foster standardized, reproducible research. Our contributions include a structured technical roadmap and practical guidelines for deploying efficient multimodal modelsโ€”bridging the gap between theoretical advances and real-world deployment constraints in edge AI.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.
Problem

Research questions and friction points this paper is trying to address.

Optimizing vision-language models for edge devices
Reducing computational demands of VLMs for real-time use
Exploring compact architectures and performance-memory trade-offs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizing VLMs for edge devices
Exploring compact VLM architectures
Analyzing performance-memory trade-offs
๐Ÿ”Ž Similar Papers
No similar papers found.
Gaurav Shinde
Gaurav Shinde
PhD Student, University of Maryland, Baltimore County
Machine LearningCyber Physical SystemsRoboticsComputer Vision
A
Anuradha Ravi
Mobile Pervasive & Sensor Computing Lab and Department of Information Systems, University of Maryland Baltimore County (UMBC), Baltimore, Maryland, 21250, USA
Emon Dey
Emon Dey
Mobile Pervasive & Sensor Computing Lab and Department of Information Systems, University of Maryland Baltimore County (UMBC), Baltimore, Maryland, 21250, USA
S
S. Sakib
Mobile Pervasive & Sensor Computing Lab and Department of Information Systems, University of Maryland Baltimore County (UMBC), Baltimore, Maryland, 21250, USA
M
Milind Rampure
Mobile Pervasive & Sensor Computing Lab and Department of Information Systems, University of Maryland Baltimore County (UMBC), Baltimore, Maryland, 21250, USA
Nirmalya Roy
Nirmalya Roy
Professor, University of Maryland Baltimore County
pervasive computingmobile computingsensor networks