Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Infrared and visible image fusion (IVIF) faces challenges including insufficient cross-modal interaction modeling, weak enhancement of task-critical regions, and difficulty preserving semantic consistency. To address these, we propose an end-to-end deep fusion framework featuring: (i) a modality-aware attention mechanism to explicitly model cross-modal complementarity; (ii) a pixel-wise adaptive α-fusion module for content-aware weight learning; and (iii) a weakly supervised, ROI-guided target-aware loss to improve semantic fidelity and interpretability—particularly for pedestrians and vehicles. Evaluated on the M3FD dataset, our method significantly enhances both visual quality and semantic consistency of fused images. Moreover, it delivers consistent performance gains in downstream tasks—including object detection and scene understanding—demonstrating its effectiveness in enabling high-fidelity multimodal perception.

Technology Category

Application Category

📝 Abstract
Infrared and visible image fusion (IVIF) is a fundamental task in multi-modal perception that aims to integrate complementary structural and textural cues from different spectral domains. In this paper, we propose FusionNet, a novel end-to-end fusion framework that explicitly models inter-modality interaction and enhances task-critical regions. FusionNet introduces a modality-aware attention mechanism that dynamically adjusts the contribution of infrared and visible features based on their discriminative capacity. To achieve fine-grained, interpretable fusion, we further incorporate a pixel-wise alpha blending module, which learns spatially-varying fusion weights in an adaptive and content-aware manner. Moreover, we formulate a target-aware loss that leverages weak ROI supervision to preserve semantic consistency in regions containing important objects (e.g., pedestrians, vehicles). Experiments on the public M3FD dataset demonstrate that FusionNet generates fused images with enhanced semantic preservation, high perceptual quality, and clear interpretability. Our framework provides a general and extensible solution for semantic-aware multi-modal image fusion, with benefits for downstream tasks such as object detection and scene understanding.
Problem

Research questions and friction points this paper is trying to address.

Integrating complementary structural and textural cues from infrared and visible images
Dynamically adjusting feature contributions with modality-aware attention mechanism
Preserving semantic consistency in target regions using weak supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-aware attention mechanism dynamically adjusts feature contributions
Pixel-wise alpha blending module learns adaptive fusion weights
Target-aware loss leverages weak ROI supervision for semantics
T
Tianyao Sun
Independent researcher, New York, NY, USA
Dawei Xiang
Dawei Xiang
University of Connecticut
computer visionartificial intelligencebiomedical informaticsdeep learning
T
Tianqi Ding
Dept. of Electrical and Computer Engineering, Baylor University, Waco, TX, USA
X
Xiang Fang
Dept. of Computer Science, Baylor University, Waco, TX, USA
Yijiashun Qi
Yijiashun Qi
University of Michigan
Z
Zunduo Zhao
Dept. of Computer Science, New York University, New York, NY, USA