🤖 AI Summary
This work addresses general-purpose image fusion—specifically, the unsupervised, label-free integration of cross-modal images (e.g., infrared–visible, medical, multi-focus, multi-exposure) to enhance downstream detection and segmentation performance without modality-specific priors.
Method: We propose the first fusion-oriented self-supervised Multi-Path Consensus Mamba architecture. It incorporates a modality-agnostic feature enhancement module and a novel Bi-level Self-supervised Contrastive Learning (BSCL) loss. Leveraging spatial-channel-frequency domain rotational scanning and a multi-expert dynamic consensus mechanism, it preserves high-frequency details with zero computational overhead.
Results: Extensive experiments demonstrate state-of-the-art performance across four fundamental fusion tasks and their corresponding downstream vision tasks, validating the method’s strong generalization, computational efficiency, and practical applicability.
📝 Abstract
Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency-rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.