A Survey on Efficient Vision-Language Models

📅 2025-04-13

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the high computational overhead and memory bottlenecks hindering the deployment of vision-language models (VLMs) on edge and resource-constrained devices, this paper presents a systematic survey of efficient VLM optimization techniques. We propose the first comprehensive taxonomy covering six key directions: model pruning, quantization, knowledge distillation, lightweight attention mechanisms, multimodal adapters, and hardware-software co-optimization. Over one hundred state-of-the-art works are critically reviewed and empirically analyzed in terms of their latency, GPU memory footprint, and accuracy trade-offs. Furthermore, we establish an actively maintained, open-source GitHub repository to foster standardized, reproducible research. Our contributions include a structured technical roadmap and practical guidelines for deploying efficient multimodal models—bridging the gap between theoretical advances and real-world deployment constraints in edge AI.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.

Problem

Research questions and friction points this paper is trying to address.

Optimizing vision-language models for edge devices

Reducing computational demands of VLMs for real-time use

Exploring compact architectures and performance-memory trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizing VLMs for edge devices

Exploring compact VLM architectures

Analyzing performance-memory trade-offs

🔎 Similar Papers

Law of Vision Representation in MLLMs