🤖 AI Summary
To address the limitations of single-modality perception under low-light and complex conditions, this work proposes the first language-guided, depth-driven thermal-infrared and visible-image fusion method—departing from conventional detection-centric fusion paradigms. Methodologically, we design a CLIP-semantic-guided conditional diffusion model for fusion, jointly coupled with a dual-path depth estimation network and a depth-aware loss function, enabling end-to-end co-modeling of textual semantics, multi-spectral imagery, and geometric depth. We introduce a novel language-depth collaborative navigation mechanism that significantly enhances fusion fidelity, point cloud completeness, and 3D reconstruction accuracy. Extensive experiments on robotic navigation, autonomous driving, and emergency response scenarios demonstrate superior environmental understanding and cross-illumination robustness.
📝 Abstract
Depth-guided multimodal fusion combines depth information from visible and infrared images, significantly enhancing the performance of 3D reconstruction and robotics applications. Existing thermal-visible image fusion mainly focuses on detection tasks, ignoring other critical information such as depth. By addressing the limitations of single modalities in low-light and complex environments, the depth information from fused images not only generates more accurate point cloud data, improving the completeness and precision of 3D reconstruction, but also provides comprehensive scene understanding for robot navigation, localization, and environmental perception. This supports precise recognition and efficient operations in applications such as autonomous driving and rescue missions. We introduce a text-guided and depth-driven infrared and visible image fusion network. The model consists of an image fusion branch for extracting multi-channel complementary information through a diffusion model, equipped with a text-guided module, and two auxiliary depth estimation branches. The fusion branch uses CLIP to extract semantic information and parameters from depth-enriched image descriptions to guide the diffusion model in extracting multi-channel features and generating fused images. These fused images are then input into the depth estimation branches to calculate depth-driven loss, optimizing the image fusion network. This framework aims to integrate vision-language and depth to directly generate color-fused images from multimodal inputs.