🤖 AI Summary
This work addresses the challenges of deploying vision-language models at the edge under stringent computational and memory constraints, where uploading raw data incurs high latency under limited bandwidth. Existing edge-cloud collaborative approaches often lack adaptability to dynamic network conditions and overlook semantic redundancy. To overcome these limitations, the authors propose a progressive semantic communication framework that leverages a Meta AutoEncoder to compress visual tokens into adaptive representations amenable to progressive refinement. This enables on-demand transmission of semantic information between edge and cloud without requiring model fine-tuning, achieving plug-and-play compatibility with off-the-shelf vision-language models. Notably, this is the first training-free approach to progressive semantic communication. Evaluated under a 1 Mbps uplink bandwidth, the system substantially reduces latency while preserving high semantic fidelity even at high compression ratios.
📝 Abstract
Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth-limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge-cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed-size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge-cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug-and-play deployment with off-the-shelf VLMs without additional fine-tuning. This design allows flexible transmission at different information levels, providing a controllable trade-off between communication cost and semantic fidelity. We implement a full end-to-end edge-cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth-constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full-edge and full-cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at https://github.com/open-ep/ProSemComVLM.