Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Traditional object detection methods suffer performance degradation in complex scenarios—such as low-light conditions and severe occlusion—due to insufficient semantic understanding. To address this, we propose an adaptive semantic-enhanced edge-cloud collaborative detection framework leveraging multimodal large language models (MLLMs). Our approach features three key contributions: (1) instruction-tuning an MLLM to generate structured scene descriptions; (2) designing a lightweight adaptive mapping module that dynamically translates semantic descriptions into parameter-adjustment signals for edge-side detectors; and (3) establishing a confidence-driven edge-cloud inference mechanism to optimally balance semantic enhancement and resource overhead. Experimental results demonstrate that our method maintains high detection accuracy while reducing latency by over 79% and computational cost by 70% compared to baseline approaches, significantly improving real-time performance and practicality in challenging environments.

Technology Category

Application Category

📝 Abstract

Traditional object detection methods face performance degradation challenges in complex scenarios such as low-light conditions and heavy occlusions due to a lack of high-level semantic understanding. To address this, this paper proposes an adaptive guidance-based semantic enhancement edge-cloud collaborative object detection method leveraging Multimodal Large Language Models (MLLM), achieving an effective balance between accuracy and efficiency. Specifically, the method first employs instruction fine-tuning to enable the MLLM to generate structured scene descriptions. It then designs an adaptive mapping mechanism that dynamically converts semantic information into parameter adjustment signals for edge detectors, achieving real-time semantic enhancement. Within an edge-cloud collaborative inference framework, the system automatically selects between invoking cloud-based semantic guidance or directly outputting edge detection results based on confidence scores. Experiments demonstrate that the proposed method effectively enhances detection accuracy and efficiency in complex scenes. Specifically, it can reduce latency by over 79% and computational cost by 70% in low-light and highly occluded scenes while maintaining accuracy.

Problem

Research questions and friction points this paper is trying to address.

Enhancing object detection in challenging conditions like low-light and occlusions

Balancing detection accuracy with computational efficiency in edge-cloud systems

Converting semantic understanding into real-time parameter adjustments for detectors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive mapping converts semantics to detector parameters

Edge-cloud framework selects guidance based on confidence scores

Multimodal LLM generates structured scene descriptions via fine-tuning

🔎 Similar Papers

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection