Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional object detection methods suffer performance degradation in complex scenarios—such as low-light conditions and severe occlusion—due to insufficient semantic understanding. To address this, we propose an adaptive semantic-enhanced edge-cloud collaborative detection framework leveraging multimodal large language models (MLLMs). Our approach features three key contributions: (1) instruction-tuning an MLLM to generate structured scene descriptions; (2) designing a lightweight adaptive mapping module that dynamically translates semantic descriptions into parameter-adjustment signals for edge-side detectors; and (3) establishing a confidence-driven edge-cloud inference mechanism to optimally balance semantic enhancement and resource overhead. Experimental results demonstrate that our method maintains high detection accuracy while reducing latency by over 79% and computational cost by 70% compared to baseline approaches, significantly improving real-time performance and practicality in challenging environments.

Technology Category

Application Category

📝 Abstract
Traditional object detection methods face performance degradation challenges in complex scenarios such as low-light conditions and heavy occlusions due to a lack of high-level semantic understanding. To address this, this paper proposes an adaptive guidance-based semantic enhancement edge-cloud collaborative object detection method leveraging Multimodal Large Language Models (MLLM), achieving an effective balance between accuracy and efficiency. Specifically, the method first employs instruction fine-tuning to enable the MLLM to generate structured scene descriptions. It then designs an adaptive mapping mechanism that dynamically converts semantic information into parameter adjustment signals for edge detectors, achieving real-time semantic enhancement. Within an edge-cloud collaborative inference framework, the system automatically selects between invoking cloud-based semantic guidance or directly outputting edge detection results based on confidence scores. Experiments demonstrate that the proposed method effectively enhances detection accuracy and efficiency in complex scenes. Specifically, it can reduce latency by over 79% and computational cost by 70% in low-light and highly occluded scenes while maintaining accuracy.
Problem

Research questions and friction points this paper is trying to address.

Enhancing object detection in challenging conditions like low-light and occlusions
Balancing detection accuracy with computational efficiency in edge-cloud systems
Converting semantic understanding into real-time parameter adjustments for detectors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive mapping converts semantics to detector parameters
Edge-cloud framework selects guidance based on confidence scores
Multimodal LLM generates structured scene descriptions via fine-tuning
🔎 Similar Papers
No similar papers found.
Y
Yunqing Hu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Z
Zheming Yang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Chang Zhao
Chang Zhao
University of Florida
Ecosystem ServicesLandscape EcologyGeoAISpatial Data ScienceRemote Sensing
W
Wen Ji
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; Institute of AI for Industries, Nanjing, China