MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the high computational overhead and latency challenges of deploying multimodal large language models (MLLMs) at the edge. The authors propose an edge-cloud collaborative, adaptive modality-aware offloading framework that introduces a novel modality activation sparsity metric and a confidence-guided speculative execution mechanism. Integrated with a lightweight heterogeneous modality-aware module and spatiotemporal joint sparsity analysis across modalities, the framework enables fine-grained dynamic scheduling. Experimental results on VQAv2 and MMBench demonstrate that the approach reduces end-to-end latency by 30% and cuts resource consumption by 30%–65%, while achieving 1.5–2.3× higher throughput—all without compromising accuracy.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) enable powerful cross-modal reasoning capabilities but impose substantial computational and latency burdens, posing critical challenges for deployment on resource-constrained edge devices. In this paper, we propose MSAO, an adaptive modality sparsity-aware offloading framework with edge-cloud collaboration for efficient MLLM Inference. First, a lightweight heterogeneous modality-aware via fine-grained sparsity module performs spatial-temporal-modal joint analysis to compute the Modality Activation Sparsity (MAS) metric, which quantifies the necessity of each modality with minimal computational overhead. Second, an adaptive speculative edge-cloud collaborative offloading mechanism dynamically schedules workloads between edge and cloud based on the derived MAS scores and real-time system states, leveraging confidence-guided speculative execution to hide communication latency. Extensive experiments on VQAv2 and MMBench benchmarks demonstrate that MSAO achieves a 30% reduction in end-to-end latency and 30%-65% decrease in resource overhead, while delivering a throughput improvement of 1.5x to 2.3x compared to traditional approaches, all without compromising competitive accuracy.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Edge Computing

Latency

Resource Constraints

Model Offloading

Innovation

Methods, ideas, or system contributions that make the work stand out.

modality sparsity

edge-cloud collaboration

speculative offloading