Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing image fusion methods suffer from limited robustness, adaptability, and controllability, struggling with complex degradations such as low illumination, color casts, and exposure imbalance. They are further constrained by the scarcity of real-world paired data and small-scale benchmarks, hindering effective semantic understanding and fine-grained multimodal alignment. To address these challenges, we propose DiTFuse—the first instruction-driven diffusion Transformer fusion framework—unifying dual-image inputs and natural language instructions for end-to-end, semantics-aware fusion. Our approach innovatively integrates diffusion modeling with Transformer architectures to enable cross-task zero-shot generalization and text-guided semantic-level control. Key technical contributions include multi-degradation mask modeling, joint vision-language latent encoding, and a modality-invariant restoration mechanism. DiTFuse achieves state-of-the-art performance on IVIF, MFF, and MEF benchmarks, yielding fused images with enhanced texture clarity and superior semantic fidelity, while also supporting downstream tasks such as instruction-conditioned segmentation.

Technology Category

Application Category

📝 Abstract

Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.

Problem

Research questions and friction points this paper is trying to address.

Unifies semantic and controllable image fusion across multiple modalities

Addresses lack of user intent incorporation in complex degradation scenarios

Overcomes absence of ground truth and small datasets for training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-Transformer framework for end-to-end image fusion

Multi-degradation masked-image modeling without ground truth

Natural-language instructions enable hierarchical fusion control

🔎 Similar Papers

DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion