Empirical Recipes for Efficient and Compact Vision-Language Models

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the high inference latency and insufficient throughput of existing compact vision-language models (VLMs) in resource-constrained environments. Through end-to-end efficiency profiling, the study identifies key inference bottlenecks and proposes a general system-level optimization framework encompassing performance analysis, inference acceleration, and structured, perception-aware output design, compatible with diverse VLM architectures and deployment frameworks. The resulting ArgusVLM family achieves substantial efficiency gains without compromising accuracy: it reduces first-token generation latency by 53% on InternVL3-2B and by 93% on SmolVLM-256M, while demonstrating strong performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

inference efficiency

compact models

latency

resource-constrained deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

model efficiency

inference optimization