🤖 AI Summary
This work addresses the high communication overhead, latency, and limited scalability in edge–cloud collaborative large language model inference caused by frequent updates of the target model. To this end, we propose a communication-efficient collaborative inference framework featuring a novel decoupled shared-backbone architecture that separates a static edge-side draft model from a dynamically updated cloud-side target model, thereby eliminating redundant model retraining and downloads at the edge. Furthermore, we introduce a channel-aware adaptive speculation mechanism that dynamically adjusts the draft length according to wireless conditions and device constraints. Experimental results across diverse edge scenarios demonstrate that our approach significantly outperforms conventional speculative decoding, substantially reducing communication costs and end-to-end latency while enhancing inference efficiency and system scalability.
📝 Abstract
Deploying large language models (LLMs) in mobile and edge computing environments is constrained by limited on-device resources, scarce wireless bandwidth, and frequent model evolution. Although edge-cloud collaborative inference with speculative decoding (SD) can reduce end-to-end latency by executing a lightweight draft model at the edge and verifying it with a cloud-side target model, existing frameworks fundamentally rely on tight coupling between the two models. Consequently, repeated model synchronization introduces excessive communication overhead, increasing end-to-end latency, and ultimately limiting the scalability of SD in edge environments. To address these limitations, we propose FlexSpec, a communication-efficient collaborative inference framework tailored for evolving edge-cloud systems. The core design of FlexSpec is a shared-backbone architecture that allows a single and static edge-side draft model to remain compatible with a large family of evolving cloud-side target models. By decoupling edge deployment from cloud-side model updates, FlexSpec eliminates the need for edge-side retraining or repeated model downloads, substantially reducing communication and maintenance costs. Furthermore, to accommodate time-varying wireless conditions and heterogeneous device constraints, we develop a channel-aware adaptive speculation mechanism that dynamically adjusts the speculative draft length based on real-time channel state information and device energy budgets. Extensive experiments demonstrate that FlexSpec achieves superior performance compared to conventional SD approaches in terms of inference efficiency.