QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high deployment cost and latency of large video-language models (VLMs), as well as the difficulty of lightweight local models in balancing accuracy and real-time performance. To this end, the authors propose a local-first, on-demand edge-augmented collaborative inference framework that enhances efficiency without compromising accuracy. Key innovations include shared visual representations, accelerated video tokenization, a query-adaptive edge augmentation mechanism, and latency-aware dynamic control of visual token density. Experimental results demonstrate that the proposed approach achieves accuracy comparable to large VLMs across multiple video understanding benchmarks while reducing response latency by up to 12.8×.

Technology Category

Application Category

📝 Abstract
Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.
Problem

Research questions and friction points this paper is trying to address.

video-language models
response delay
accuracy-latency trade-off
edge inference
video querying service
Innovation

Methods, ideas, or system contributions that make the work stand out.

video-language models
edge-augmented inference
accelerated tokenization
local-first architecture
QoS-aware system
🔎 Similar Papers
No similar papers found.