MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

πŸ“… 2026-04-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high computational overhead and latency challenges of deploying multimodal large language models (MLLMs) at the edge. The authors propose an edge-cloud collaborative, adaptive modality-aware offloading framework that introduces a novel modality activation sparsity metric and a confidence-guided speculative execution mechanism. Integrated with a lightweight heterogeneous modality-aware module and spatiotemporal joint sparsity analysis across modalities, the framework enables fine-grained dynamic scheduling. Experimental results on VQAv2 and MMBench demonstrate that the approach reduces end-to-end latency by 30% and cuts resource consumption by 30%–65%, while achieving 1.5–2.3Γ— higher throughputβ€”all without compromising accuracy.
πŸ“ Abstract
Multimodal large language models (MLLMs) enable powerful cross-modal reasoning capabilities but impose substantial computational and latency burdens, posing critical challenges for deployment on resource-constrained edge devices. In this paper, we propose MSAO, an adaptive modality sparsity-aware offloading framework with edge-cloud collaboration for efficient MLLM Inference. First, a lightweight heterogeneous modality-aware via fine-grained sparsity module performs spatial-temporal-modal joint analysis to compute the Modality Activation Sparsity (MAS) metric, which quantifies the necessity of each modality with minimal computational overhead. Second, an adaptive speculative edge-cloud collaborative offloading mechanism dynamically schedules workloads between edge and cloud based on the derived MAS scores and real-time system states, leveraging confidence-guided speculative execution to hide communication latency. Extensive experiments on VQAv2 and MMBench benchmarks demonstrate that MSAO achieves a 30% reduction in end-to-end latency and 30%-65% decrease in resource overhead, while delivering a throughput improvement of 1.5x to 2.3x compared to traditional approaches, all without compromising competitive accuracy.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Edge Computing
Latency
Resource Constraints
Model Offloading
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality sparsity
edge-cloud collaboration
speculative offloading
multimodal LLM inference
adaptive scheduling
πŸ”Ž Similar Papers
No similar papers found.
Z
Zheming Yang
Institute of Computing Technology, Chinese Academy of Sciences
Q
Qi Guo
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
J
Jun Wan
University of Science and Technology of China
J
Jiarui Ruan
University of Illinois at Urbana-Champaign
Y
Yunqing Hu
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Chang Zhao
Chang Zhao
University of Florida
Ecosystem ServicesLandscape EcologyGeoAISpatial Data ScienceRemote Sensing
X
Xiangyang Li
Institute of Computing Technology, Chinese Academy of Sciences