🤖 AI Summary
Infrared and visible image fusion (IVIF) suffers from performance limitations due to the absence of ground-truth paired benchmarks, forcing existing deep learning methods to rely heavily on hand-crafted mathematical loss functions. To address this, we propose the first language-driven IVIF framework that leverages natural language descriptions as semantic fusion targets. Specifically, we exploit CLIP to construct a joint text-image embedding space and introduce a novel language-guided semantic alignment loss, thereby eliminating reliance on traditional mathematical priors. Our method comprises four key components: (1) CLIP-based multimodal encoding, (2) text-guided embedding space modeling, (3) semantic alignment supervision during training, and (4) a dual-modality feature fusion network. Extensive experiments on multiple benchmark datasets demonstrate consistent and significant improvements over state-of-the-art methods, achieving superior performance in both qualitative and quantitative evaluations. This work establishes a new paradigm for unsupervised, cross-modal image fusion without ground-truth supervision.
📝 Abstract
Infrared-visible image fusion (IVIF) has attracted much attention owing to the highly-complementary properties of the two image modalities. Due to the lack of ground-truth fused images, the fusion output of current deep-learning based methods heavily depends on the loss functions defined mathematically. As it is hard to well mathematically define the fused image without ground truth, the performance of existing fusion methods is limited. In this paper, we first propose to use natural language to express the objective of IVIF, which can avoid the explicit mathematical modeling of fusion output in current losses, and make full use of the advantage of language expression to improve the fusion performance. For this purpose, we present a comprehensive language-expressed fusion objective, and encode relevant texts into the multi-modal embedding space using CLIP. A language-driven fusion model is then constructed in the embedding space, by establishing the relationship among the embedded vectors to represent the fusion objective and input image modalities. Finally, a language-driven loss is derived to make the actual IVIF aligned with the embedded language-driven fusion model via supervised training. Experiments show that our method can obtain much better fusion results than existing techniques.