🤖 AI Summary
Existing infrared and visible image fusion methods struggle to balance downstream task adaptability with semantic controllability. To address this limitation, this work proposes a mask prompt–guided controllable fusion framework that introduces, for the first time, an interactive mask prompting mechanism. By leveraging a reference prompt encoder, the method dynamically extracts task-specific semantics and explicitly injects this semantic information during the fusion process. The framework jointly optimizes fusion and segmentation objectives, enabling effective synergy between multimodal features and semantic prompts. Experimental results demonstrate that the proposed approach achieves state-of-the-art performance in both fusion controllability and segmentation accuracy, with the fine-tuned segmentation branch even surpassing the original pre-trained model in performance.
📝 Abstract
Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities, enhancing environmental awareness for intelligent unmanned systems. Existing methods either focus on pixel-level fusion while overlooking downstream task adaptability or implicitly learn rigid semantics through cascaded detection/segmentation models, unable to interactively address diverse semantic target perception needs. We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts. The model integrates a multi-modal feature extractor, a reference prompt encoder (RPE), and a prompt-semantic fusion module (PSFM). The RPE dynamically encodes task-specific semantic prompts by fine-tuning pre-trained segmentation models with input mask guidance, while the PSFM explicitly injects these semantics into fusion features. Through synergistic optimization of parallel segmentation and fusion branches, our method achieves mutual enhancement between task performance and fusion quality. Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.