Language-Depth Navigated Thermal and Visible Image Fusion

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the limitations of single-modality perception under low-light and complex conditions, this work proposes the first language-guided, depth-driven thermal-infrared and visible-image fusion method—departing from conventional detection-centric fusion paradigms. Methodologically, we design a CLIP-semantic-guided conditional diffusion model for fusion, jointly coupled with a dual-path depth estimation network and a depth-aware loss function, enabling end-to-end co-modeling of textual semantics, multi-spectral imagery, and geometric depth. We introduce a novel language-depth collaborative navigation mechanism that significantly enhances fusion fidelity, point cloud completeness, and 3D reconstruction accuracy. Extensive experiments on robotic navigation, autonomous driving, and emergency response scenarios demonstrate superior environmental understanding and cross-illumination robustness.

Technology Category

Application Category

📝 Abstract

Depth-guided multimodal fusion combines depth information from visible and infrared images, significantly enhancing the performance of 3D reconstruction and robotics applications. Existing thermal-visible image fusion mainly focuses on detection tasks, ignoring other critical information such as depth. By addressing the limitations of single modalities in low-light and complex environments, the depth information from fused images not only generates more accurate point cloud data, improving the completeness and precision of 3D reconstruction, but also provides comprehensive scene understanding for robot navigation, localization, and environmental perception. This supports precise recognition and efficient operations in applications such as autonomous driving and rescue missions. We introduce a text-guided and depth-driven infrared and visible image fusion network. The model consists of an image fusion branch for extracting multi-channel complementary information through a diffusion model, equipped with a text-guided module, and two auxiliary depth estimation branches. The fusion branch uses CLIP to extract semantic information and parameters from depth-enriched image descriptions to guide the diffusion model in extracting multi-channel features and generating fused images. These fused images are then input into the depth estimation branches to calculate depth-driven loss, optimizing the image fusion network. This framework aims to integrate vision-language and depth to directly generate color-fused images from multimodal inputs.

Problem

Research questions and friction points this paper is trying to address.

Enhances 3D reconstruction and robotics via depth-guided image fusion.

Improves scene understanding for robot navigation and environmental perception.

Integrates vision-language and depth for precise multimodal image fusion.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-guided multimodal fusion enhances 3D reconstruction.

Text-guided diffusion model extracts multi-channel features.

Depth-driven loss optimizes image fusion network.

🔎 Similar Papers

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection