🤖 AI Summary
Current multimodal image fusion methods are often decoupled from downstream vision tasks, yielding fused representations suboptimal for high-level semantic understanding—e.g., semantic segmentation. To address this, we propose UAAFusion, an attribution-guided unrolled fusion network that explicitly integrates attribution analysis (e.g., gradient-based interpretability methods) into the fusion process for the first time. It introduces an attribution-driven fusion loss and an attribution-path function to align fusion objectives with task-relevant semantics. Furthermore, we design a phased attribution-aware attention mechanism and a memory-enhanced module to enable bidirectional deep synergy between fusion and segmentation while preserving feature fidelity. Evaluated across multiple cross-modal benchmarks, our joint optimization framework achieves significant improvements over state-of-the-art methods in both fusion quality and segmentation accuracy, empirically validating the effectiveness and generalizability of the attribution-guided paradigm.
📝 Abstract
Multi-modal image fusion synthesizes information from multiple sources into a single image, facilitating downstream tasks such as semantic segmentation. Current approaches primarily focus on acquiring informative fusion images at the visual display stratum through intricate mappings. Although some approaches attempt to jointly optimize image fusion and downstream tasks, these efforts often lack direct guidance or interaction, serving only to assist with a predefined fusion loss. To address this, we propose an ``Unfolding Attribution Analysis Fusion network'' (UAAFusion), using attribution analysis to tailor fused images more effectively for semantic segmentation, enhancing the interaction between the fusion and segmentation. Specifically, we utilize attribution analysis techniques to explore the contributions of semantic regions in the source images to task discrimination. At the same time, our fusion algorithm incorporates more beneficial features from the source images, thereby allowing the segmentation to guide the fusion process. Our method constructs a model-driven unfolding network that uses optimization objectives derived from attribution analysis, with an attribution fusion loss calculated from the current state of the segmentation network. We also develop a new pathway function for attribution analysis, specifically tailored to the fusion tasks in our unfolding network. An attribution attention mechanism is integrated at each network stage, allowing the fusion network to prioritize areas and pixels crucial for high-level recognition tasks. Additionally, to mitigate the information loss in traditional unfolding networks, a memory augmentation module is incorporated into our network to improve the information flow across various network layers. Extensive experiments demonstrate our method's superiority in image fusion and applicability to semantic segmentation.