🤖 AI Summary
Existing infrared–visible image fusion methods rely heavily on synthetically generated multi-modal, multi-quality paired data, limiting their generalizability to real-world degradation scenarios. To address this, we propose a degradation-aware unified representation learning framework. Our key contributions are: (1) a novel data-level disentanglement and re-coupling mechanism in a shared latent feature space, explicitly modeling cross-modal degradation discrepancies; (2) a unified loss function designed to support training on real degraded data; and (3) text-guided attention (TGA) to enhance semantic alignment between modalities and preserve fine-grained details. Integrating intra-residual architecture with degradation-aware joint optimization, our method achieves state-of-the-art performance across generic fusion, degradation-robust fusion, and downstream detection/segmentation tasks. Notably, it is the first to jointly realize realistic degradation modeling and high-fidelity fusion within a single unified framework.
📝 Abstract
All-in-One Degradation-Aware Fusion Models (ADFMs), a class of multi-modal image fusion models, address complex scenes by mitigating degradations from source images and generating high-quality fused images. Mainstream ADFMs often rely on highly synthetic multi-modal multi-quality images for supervision, limiting their effectiveness in cross-modal and rare degradation scenarios. The inherent relationship among these multi-modal, multi-quality images of the same scene provides explicit supervision for training, but also raises above problems. To address these limitations, we present LURE, a Learning-driven Unified Representation model for infrared and visible Image Fusion, which is degradation-aware. LURE decouples multi-modal multi-quality data at the data level and recouples this relationship in a unified latent feature space (ULFS) by proposing a novel unified loss. This decoupling circumvents data-level limitations of prior models and allows leveraging real-world restoration datasets for training high-quality degradation-aware models, sidestepping above issues. To enhance text-image interaction, we refine image-text interaction and residual structures via Text-Guided Attention (TGA) and an inner residual structure. These enhances text's spatial perception of images and preserve more visual details. Experiments show our method outperforms state-of-the-art (SOTA) methods across general fusion, degradation-aware fusion, and downstream tasks. The code will be publicly available.