QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the high deployment cost and latency of large video-language models (VLMs), as well as the difficulty of lightweight local models in balancing accuracy and real-time performance. To this end, the authors propose a local-first, on-demand edge-augmented collaborative inference framework that enhances efficiency without compromising accuracy. Key innovations include shared visual representations, accelerated video tokenization, a query-adaptive edge augmentation mechanism, and latency-aware dynamic control of visual token density. Experimental results demonstrate that the proposed approach achieves accuracy comparable to large VLMs across multiple video understanding benchmarks while reducing response latency by up to 12.8×.

Technology Category

Application Category

📝 Abstract

Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.

Problem

Research questions and friction points this paper is trying to address.

video-language models

response delay

accuracy-latency trade-off

edge inference

video querying service

Innovation

Methods, ideas, or system contributions that make the work stand out.

video-language models

edge-augmented inference

accelerated tokenization