🤖 AI Summary
To address key challenges in infrared and visible image fusion—including imbalanced modality representation, weak distribution modeling in generative methods, and lack of interpretability in modality selection—this paper proposes HCLFuse, a cognition-inspired generative fusion framework. HCLFuse is the first to incorporate human cognitive principles into image fusion, featuring a multi-scale masked variational bottleneck encoder and a time-varying physics-guided diffusion model to achieve high-fidelity structural detail reconstruction and enhanced cross-modal semantic consistency. We introduce an information-mapping quantification theory and a physics-driven, interpretable modality selection mechanism. The framework operates in an unsupervised manner without requiring paired training data. Extensive experiments on multiple benchmark datasets demonstrate state-of-the-art performance, with significant improvements in semantic segmentation metrics, validating its robustness and practicality in complex scenarios.
📝 Abstract
Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.