SpotVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cloud-edge collaborative architectures struggle to adapt to dynamic cloud latency fluctuations in real-time vision-language tasks and fail to effectively leverage high-accuracy yet high-latency large vision-language models (LVLMs). To address this, we propose a novel context migration paradigm: for the first time, modeling delayed LVLM outputs as reusable historical context to guide lightweight edge-model inference. Our approach introduces two key modules—context replacement and visual focusing—to jointly optimize textual representations and enhance visual localization consistency, thereby achieving low-latency execution without sacrificing accuracy. Extensive experiments across four benchmark datasets and three real-time vision-language tasks demonstrate significant improvements over state-of-the-art methods, validating the framework’s robustness and efficiency under dynamic latency conditions.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction, which demand fast and reliable responses based on accurate perception. To meet these requirements, existing systems commonly employ cloud-edge collaborative architectures, such as partitioned Large Vision-Language Models (LVLMs) or task offloading strategies between Large and Small Vision-Language Models (SVLMs). However, these methods fail to accommodate cloud latency fluctuations and overlook the full potential of delayed but accurate LVLM responses. In this work, we propose a novel cloud-edge collaborative paradigm for VLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context to provide real-time guidance for SVLMs inference. Based on this paradigm, we design SpotVLM, which incorporates both context replacement and visual focus modules to refine historical textual input and enhance visual grounding consistency. Extensive experiments on three real-time vision tasks across four datasets demonstrate the effectiveness of the proposed framework. The new paradigm lays the groundwork for more effective and latency-aware collaboration strategies in future VLM systems.
Problem

Research questions and friction points this paper is trying to address.

Address cloud latency fluctuations in real-time VLM applications
Leverage delayed LVLM outputs to guide real-time SVLM inference
Enhance visual grounding consistency in cloud-edge collaborative VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cloud-edge collaboration with Context Transfer
Context replacement and visual focus modules
Utilizes delayed LVLM outputs as historical context
🔎 Similar Papers
No similar papers found.
C
Chen Qian
Tsinghua University
Xinran Yu
Xinran Yu
Unknown affiliation
Z
Zewen Huang
Tsinghua University
Danyang Li
Danyang Li
Shuimu Scholar, Tsinghua University
Embodied AIMobile ComputingInternet of ThingsEdge ComputingSLAM System
Q
Qiang Ma
Tsinghua University
F
Fan Dang
Beijing Jiaotong University
X
Xuan Ding
Tsinghua University
G
Guangyong Shang
Inspur Yunzhou Industrial Internet Co., Ltd
Z
Zheng Yang
Tsinghua University