🤖 AI Summary
Existing image fusion methods rely on high-level semantic task interactions, suffering from semantic gaps that limit generalizability and universality. This paper proposes a low-level vision–driven paradigm—specifically, pixel-level reconstruction—as a foundation for universal fusion, eliminating task-specific semantic modeling. We introduce GIFNet, a unified representation architecture trained via multi-task joint learning under pixel-level supervision. Our key contributions are: (i) the first task-agnostic fusion framework guided by low-level task interactions; (ii) zero-shot generalization of a single model to unseen modality pairs (e.g., infrared/visible-light, MRI/PET); and (iii) emergent capability for single-modality image enhancement. GIFNet achieves state-of-the-art performance across diverse cross-modal fusion benchmarks, demonstrating superior generalizability, architectural unity, and practical applicability.
📝 Abstract
Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be available at https://github.com/AWCXV/GIFNet.