π€ AI Summary
This work addresses the high computational overhead and latency challenges of deploying multimodal large language models (MLLMs) at the edge. The authors propose an edge-cloud collaborative, adaptive modality-aware offloading framework that introduces a novel modality activation sparsity metric and a confidence-guided speculative execution mechanism. Integrated with a lightweight heterogeneous modality-aware module and spatiotemporal joint sparsity analysis across modalities, the framework enables fine-grained dynamic scheduling. Experimental results on VQAv2 and MMBench demonstrate that the approach reduces end-to-end latency by 30% and cuts resource consumption by 30%β65%, while achieving 1.5β2.3Γ higher throughputβall without compromising accuracy.
π Abstract
Multimodal large language models (MLLMs) enable powerful cross-modal reasoning capabilities but impose substantial computational and latency burdens, posing critical challenges for deployment on resource-constrained edge devices. In this paper, we propose MSAO, an adaptive modality sparsity-aware offloading framework with edge-cloud collaboration for efficient MLLM Inference. First, a lightweight heterogeneous modality-aware via fine-grained sparsity module performs spatial-temporal-modal joint analysis to compute the Modality Activation Sparsity (MAS) metric, which quantifies the necessity of each modality with minimal computational overhead. Second, an adaptive speculative edge-cloud collaborative offloading mechanism dynamically schedules workloads between edge and cloud based on the derived MAS scores and real-time system states, leveraging confidence-guided speculative execution to hide communication latency. Extensive experiments on VQAv2 and MMBench benchmarks demonstrate that MSAO achieves a 30% reduction in end-to-end latency and 30%-65% decrease in resource overhead, while delivering a throughput improvement of 1.5x to 2.3x compared to traditional approaches, all without compromising competitive accuracy.