DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of deploying multimodal large language models for video stream analysis in bandwidth-constrained edge–cloud systems, where high computational and communication overhead, elevated semantic alert latency, and inefficient visual evidence transmission hinder performance. To overcome these limitations, the authors propose a collaborative cascaded architecture that pairs a lightweight edge model with a powerful cloud-based large model, triggering deep inference only on suspicious frames. They further introduce an efficient fine-tuning strategy combining visual grounding and semantic prompting to enhance structured event understanding, alongside a dual-aware adaptive transmission mechanism that jointly optimizes for semantic relevance and bandwidth constraints. Experimental results demonstrate that the system achieves 98.83% recognition accuracy and 100% output consistency, reduces weighted semantic alert latency by 77.5% under severe network congestion, and delivers 98.33% of visual evidence within 0.5 seconds.
📝 Abstract
Multimodal large language models (MLLMs) have shown strong capability in semantic understanding and visual reasoning, yet their use on continuous video streams in bandwidth-constrained edge-cloud systems incurs prohibitive computation and communication overhead and hinders low-latency alerting and effective visual evidence delivery. To address this challenge, we propose DAT to achieve high-quality semantic generation, low-latency event alerting, and effective visual evidence supplementation. To reduce unnecessary deep reasoning costs, we propose a collaborative small-large model cascade. A lightweight edge-side small model acts as a gating module to filter non-target-event frames and perform object detection, triggering MLLM inference only for suspicious frames. Building on this, we introduce an efficient fine-tuning strategy with visual guidance and semantic prompting, which improves structured event understanding, object detection, and output consistency. To ensure low-latency semantic alerting and effective visual evidence supplementation under bandwidth constraints, we further devise a semantics and bandwidth-aware multi-stream adaptive transmission optimization method. Experimental results show that DAT achieves 98.83% recognition accuracy and 100% output consistency. Under severe congestion, it reduces weighted semantic alert delay by up to 77.5% and delivers 98.33% of visual evidence within 0.5 s, demonstrating the effectiveness of jointly optimizing cascade inference and elastic transmission.
Problem

Research questions and friction points this paper is trying to address.

multimodal large language models
edge-cloud systems
video streams
bandwidth constraints
low-latency alerting
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive transmission
cascade inference
multimodal LLM
edge-cloud systems
semantic alerting
🔎 Similar Papers
No similar papers found.
Q
Qi Guo
Institute of Computing Technology, Chinese Academy of Science
Z
Zheming Yang
Institute of Computing Technology, Chinese Academy of Science
Y
Yunqing Hu
Institute of Computing Technology, Chinese Academy of Science
Chang Zhao
Chang Zhao
University of Florida
Ecosystem ServicesLandscape EcologyGeoAISpatial Data ScienceRemote Sensing
Wen Ji
Wen Ji
Institute of Computing Technology, Chinese Academy of Sciences
multimedia communication & networkingvideo codingchannel codingand optimization