CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
Deploying large vision-language models on mobile devices incurs substantial computational, memory, and communication overheads, and conventional collaborative inference approaches suffer from inefficiency. To address this, this work proposes CoVSpec, a device-edge collaborative framework that leverages speculative decoding to enhance inference efficiency. Its key innovations include a training-free joint visual token pruning method that integrates query relevance, token activity, and low-rank dependencies; an adaptive draft generation strategy that dynamically adjusts draft length and verification frequency; and a decoupled parallel-branch mechanism for verification and correction to optimize communication efficiency. Experiments demonstrate that CoVSpec achieves up to a 2.21× throughput improvement and reduces communication overhead by over 96% across multiple benchmarks, all while preserving task accuracy.
📝 Abstract
Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A practical alternative is device-edge co-inference, where a lightweight draft VLM on the mobile device collaborates with a larger target VLM on the edge server via speculative decoding. Nevertheless, directly extending speculative decoding to VLMs suffers from severe inefficiency due to excessive visual-token computation and high communication overhead. To address these challenges, we propose CoVSpec, an efficient collaborative speculative decoding framework for VLM inference. Specifically, we first develop a training-free visual token reduction framework that prunes redundant visual tokens on the mobile device by jointly considering query relevance, token activity, and low-rank dependency. Moreover, we design an adaptive drafting strategy that dynamically adjusts both the verification frequency and the draft length. In addition, we introduce a parallel branching mechanism with decoupled verification-correction to improve draft-side utilization during target-side verification and reduce correction-related transmission overhead. Experiments on multiple benchmarks show that CoVSpec achieves up to 2.21x higher throughput than target-only inference and reduces communication overhead by more than 96% compared with baselines, without compromising task accuracy.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
device-edge co-inference
speculative decoding
communication overhead
visual token computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
vision-language models
device-edge co-inference
visual token reduction
adaptive drafting
🔎 Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3