๐ค AI Summary
To address the high computational overhead and memory bottlenecks hindering the deployment of vision-language models (VLMs) on edge and resource-constrained devices, this paper presents a systematic survey of efficient VLM optimization techniques. We propose the first comprehensive taxonomy covering six key directions: model pruning, quantization, knowledge distillation, lightweight attention mechanisms, multimodal adapters, and hardware-software co-optimization. Over one hundred state-of-the-art works are critically reviewed and empirically analyzed in terms of their latency, GPU memory footprint, and accuracy trade-offs. Furthermore, we establish an actively maintained, open-source GitHub repository to foster standardized, reproducible research. Our contributions include a structured technical roadmap and practical guidelines for deploying efficient multimodal modelsโbridging the gap between theoretical advances and real-world deployment constraints in edge AI.
๐ Abstract
Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.