🤖 AI Summary
Existing image fusion methods suffer from task-specific design, neglect of realistic degradations (e.g., noise), high pixel-wise computational cost, and lack of interactive capability. This paper proposes the first unified multi-task, multi-degradation, language-guided image fusion framework, supporting diverse modalities—including infrared/visible-light and medical imaging—while maintaining robustness under realistic degradations such as noise. Our core innovations include: (1) a degradation-aware prompt generation mechanism; (2) a latent-space Diffusion Transformer (DiT)-based fusion architecture; (3) dual-paradigm training combining regression and flow matching; and (4) a multimodal feature alignment strategy. A single model enables end-to-end, cross-task, cross-degradation, and language-instruction-driven fusion. It significantly outperforms both two-stage and state-of-the-art end-to-end methods across multiple benchmarks—achieving notable PSNR and SSIM gains—and supports real-time interaction. Code is publicly available.
📝 Abstract
Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations ( extit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.