Empirical Recipes for Efficient and Compact Vision-Language Models

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference latency and insufficient throughput of existing compact vision-language models (VLMs) in resource-constrained environments. Through end-to-end efficiency profiling, the study identifies key inference bottlenecks and proposes a general system-level optimization framework encompassing performance analysis, inference acceleration, and structured, perception-aware output design, compatible with diverse VLM architectures and deployment frameworks. The resulting ArgusVLM family achieves substantial efficiency gains without compromising accuracy: it reduces first-token generation latency by 53% on InternVL3-2B and by 93% on SmolVLM-256M, while demonstrating strong performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
inference efficiency
compact models
latency
resource-constrained deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models
model efficiency
inference optimization
compact models
structured perception