VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing sensor fusion methods struggle to adaptively respond to environmental variations—such as low illumination, rain, fog, or motion blur—leading to misaligned modality weights and unstable detection performance. To address this, we propose a vision-language conditioned sensor fusion framework that, for the first time, integrates vision-language models (VLMs) into the fusion pipeline. Leveraging VLMs’ strong semantic understanding of scenes, our method dynamically generates conditioned attention weights for visual, LiDAR, and infrared modalities. Through multimodal feature alignment and end-to-end joint training, the framework achieves robust cross-scenario fusion. Evaluated on real-world autonomous driving and military target detection datasets, our approach consistently outperforms conventional fusion methods under both seen and unseen adverse conditions, delivering substantial improvements in detection accuracy. These results validate the effectiveness and generalizability of semantic-guided adaptive fusion as a novel paradigm.

Technology Category

Application Category

📝 Abstract

Although fusing multiple sensor modalities can enhance object detection performance, existing fusion approaches often overlook subtle variations in environmental conditions and sensor inputs. As a result, they struggle to adaptively weight each modality under such variations. To address this challenge, we introduce Vision-Language Conditioned Fusion (VLC Fusion), a novel fusion framework that leverages a Vision-Language Model (VLM) to condition the fusion process on nuanced environmental cues. By capturing high-level environmental context such as as darkness, rain, and camera blurring, the VLM guides the model to dynamically adjust modality weights based on the current scene. We evaluate VLC Fusion on real-world autonomous driving and military target detection datasets that include image, LIDAR, and mid-wave infrared modalities. Our experiments show that VLC Fusion consistently outperforms conventional fusion baselines, achieving improved detection accuracy in both seen and unseen scenarios.

Problem

Research questions and friction points this paper is trying to address.

Adaptively weighting sensor modalities under environmental variations

Improving object detection accuracy in diverse conditions

Leveraging vision-language models for dynamic sensor fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Vision-Language Model for fusion conditioning

Dynamically adjusts modality weights using environmental cues

Improves detection accuracy in diverse scenarios

🔎 Similar Papers

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection