MAFS: Masked Autoencoder for Infrared-Visible Image Fusion and Semantic Segmentation

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing infrared–visible image fusion methods predominantly emphasize semantic injection while neglecting the macro-level synergistic mechanism between pixel-level fusion and high-level perception tasks such as semantic segmentation. To address this gap, we propose the first unified network jointly optimizing fusion and segmentation. Our method features a parallel dual-branch architecture that explicitly models bidirectional mutual enhancement between the two tasks at the task level. We introduce a dynamic weight allocation strategy to ensure training stability in multi-task learning, incorporate heterogeneous feature fusion, adopt a multi-stage Transformer decoder, and integrate ideas from masked autoencoding with cross-modal feature concatenation. Extensive experiments on multiple benchmarks demonstrate state-of-the-art performance in both fusion image visual quality and segmentation accuracy. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Infrared-visible image fusion methods aim at generating fused images with good visual quality and also facilitate the performance of high-level tasks. Indeed, existing semantic-driven methods have considered semantic information injection for downstream applications. However, none of them investigates the potential for reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks from a macroscopic task-level perspective. To address this limitation, we propose a unified network for image fusion and semantic segmentation. MAFS is a parallel structure, containing a fusion sub-network and a segmentation sub-network. On the one hand, We devise a heterogeneous feature fusion strategy to enhance semantic-aware capabilities for image fusion. On the other hand, by cascading the fusion sub-network and a segmentation backbone, segmentation-related knowledge is transferred to promote feature-level fusion-based segmentation. Within the framework, we design a novel multi-stage Transformer decoder to aggregate fine-grained multi-scale fused features efficiently. Additionally, a dynamic factor based on the max-min fairness allocation principle is introduced to generate adaptive weights of two tasks and guarantee smooth training in a multi-task manner. Extensive experiments demonstrate that our approach achieves competitive results compared with state-of-the-art methods. The code is available at https://github.com/Abraham-Einstein/MAFS/.
Problem

Research questions and friction points this paper is trying to address.

Unified network for image fusion and semantic segmentation
Reciprocal promotion between pixel-wise fusion and feature perception
Multi-task learning with adaptive weights and Transformer decoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel fusion-segmentation network structure
Multi-stage Transformer decoder aggregation
Dynamic multi-task weighting mechanism