🤖 AI Summary
This work addresses the high deployment cost and latency of large video-language models (VLMs), as well as the difficulty of lightweight local models in balancing accuracy and real-time performance. To this end, the authors propose a local-first, on-demand edge-augmented collaborative inference framework that enhances efficiency without compromising accuracy. Key innovations include shared visual representations, accelerated video tokenization, a query-adaptive edge augmentation mechanism, and latency-aware dynamic control of visual token density. Experimental results demonstrate that the proposed approach achieves accuracy comparable to large VLMs across multiple video understanding benchmarks while reducing response latency by up to 12.8×.
📝 Abstract
Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.