🤖 AI Summary
To address high transmission and computation latency in multimodal inference on edge devices—caused by uplink bandwidth constraints and large visual token volumes—this paper proposes a task-oriented visual feature compression method enabling device-edge collaborative inference. Our approach integrates K-nearest-neighbor density peak clustering with a learnable multi-hyperprior entropy model for the first time, enabling task-driven semantic feature selection and adaptive entropy coding, thereby dynamically balancing compression ratio and semantic fidelity. The method comprises task-aware feature merging and a lightweight collaborative inference architecture. Evaluated on seven VQA benchmarks, it reduces transmission overhead by up to 60% and end-to-end system latency by 50% compared to baselines, while preserving task accuracy with zero degradation.
📝 Abstract
With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference pipeline involves directly forwarding the input data to an edge server which handles all computations. However, this approach introduces high transmission latency due to limited uplink bandwidth of edge devices and significant computation latency caused by the prohibitive number of visual tokens, thus hindering delay-sensitive tasks and degrading user experience. To address this challenge, we propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection. Specifically, we employ density peaks clustering based on K nearest neighbors to reduce the number of visual features, thereby minimizing both data transmission and computational complexity. Subsequently, a learnable entropy model with hyperprior is utilized to encode and decode merged features, further reducing transmission overhead. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features, enabling a more accurate estimation of the probability distribution. Comprehensive experiments on seven visual question answering benchmarks validate the effectiveness of the proposed TOFC method. Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency while maintaining identical task performance, compared with traditional image compression methods.