🤖 AI Summary
Deploying large vision-language models on mobile devices incurs substantial computational, memory, and communication overheads, and conventional collaborative inference approaches suffer from inefficiency. To address this, this work proposes CoVSpec, a device-edge collaborative framework that leverages speculative decoding to enhance inference efficiency. Its key innovations include a training-free joint visual token pruning method that integrates query relevance, token activity, and low-rank dependencies; an adaptive draft generation strategy that dynamically adjusts draft length and verification frequency; and a decoupled parallel-branch mechanism for verification and correction to optimize communication efficiency. Experiments demonstrate that CoVSpec achieves up to a 2.21× throughput improvement and reduces communication overhead by over 96% across multiple benchmarks, all while preserving task accuracy.
📝 Abstract
Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A practical alternative is device-edge co-inference, where a lightweight draft VLM on the mobile device collaborates with a larger target VLM on the edge server via speculative decoding. Nevertheless, directly extending speculative decoding to VLMs suffers from severe inefficiency due to excessive visual-token computation and high communication overhead. To address these challenges, we propose CoVSpec, an efficient collaborative speculative decoding framework for VLM inference. Specifically, we first develop a training-free visual token reduction framework that prunes redundant visual tokens on the mobile device by jointly considering query relevance, token activity, and low-rank dependency. Moreover, we design an adaptive drafting strategy that dynamically adjusts both the verification frequency and the draft length. In addition, we introduce a parallel branching mechanism with decoupled verification-correction to improve draft-side utilization during target-side verification and reduce correction-related transmission overhead. Experiments on multiple benchmarks show that CoVSpec achieves up to 2.21x higher throughput than target-only inference and reduces communication overhead by more than 96% compared with baselines, without compromising task accuracy.