MoA-Off: Adaptive Heterogeneous Modality-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high inference latency and excessive computational overhead of multimodal large language models (MLLMs) in resource-constrained edge environments, this paper proposes an edge-cloud collaborative, adaptive, heterogeneous modality-aware offloading framework. Our method introduces a lightweight modality-aware module that jointly models visual-language input complexity and real-time system states, enabling a multidimensional feature-driven, dynamic offloading decision mechanism. It supports fine-grained, task-level intelligent scheduling to adaptively distribute computational loads between edge and cloud. Experiments demonstrate that our approach reduces end-to-end inference latency by over 30% compared to baseline methods, cuts GPU memory and computation usage by 30%–65%, and incurs only marginal accuracy degradation (<1.2%). This work establishes a new paradigm for efficient, deployable multimodal AI inference in edge-cloud settings.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) enable powerful cross-modal inference but impose significant computational and latency burdens, posing severe challenges for deployment in resource-constrained environments. In this paper, we propose MoA-Off, an adaptive heterogeneous modality-aware offloading framework with edge-cloud collaboration for efficient MLLM inference. MoA-Off introduces a lightweight heterogeneous modality-aware module that estimates the complexity of heterogeneous inputs through multi-dimensional feature analysis. Then, an adaptive edge-cloud collaborative offloading strategy is proposed that dynamically schedules workloads between edge and cloud based on modality-aware complexity scores and real-time system states. The experimental results demonstrate that MoA-Off can achieve over 30% reduction in latency and 30%-65% decrease in resource overhead while maintaining competitive accuracy compared to traditional approaches.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational and latency burdens in multimodal LLM deployment
Optimizing resource usage for MLLMs in constrained environments
Managing heterogeneous modality complexity through intelligent offloading
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-aware module analyzes heterogeneous input complexity
Adaptive offloading strategy schedules workloads edge-cloud
Lightweight framework reduces latency and resource overhead
🔎 Similar Papers
No similar papers found.
Z
Zheming Yang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Q
Qi Guo
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Y
Yunqing Hu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Chang Zhao
Chang Zhao
University of Florida
Ecosystem ServicesLandscape EcologyGeoAISpatial Data ScienceRemote Sensing
Chang Zhang
Chang Zhang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
J
Jian Zhao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, China
W
Wen Ji
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; Institute of AI for Industries, Nanjing, China; Peng Cheng Laboratory, Shenzhen, China