FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address structural distortion and detail loss arising from low-resolution inputs and weak semantic guidance in joint multimodal image fusion and super-resolution, this paper proposes FS-Diff—the first unified semantic-guided conditional diffusion framework for these tasks. Methodologically, it introduces a clarity-aware mechanism and a bidirectional feature Mamba network to enable adaptive cross-modal feature extraction; further, it designs an enhanced U-Net architecture supporting joint denoising across multiple noise levels, with both source images and semantic maps serving as dual conditioning signals for stochastic sampling. Evaluated on six public benchmarks and a newly constructed aerial multimodal scene dataset (AVMS), FS-Diff achieves state-of-the-art performance across 2×–8× super-resolution factors. It effectively restores fine-grained textures and high-level semantics while preserving structural fidelity and enhancing perceptual quality.

Technology Category

Application Category

📝 Abstract
As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.
Problem

Research questions and friction points this paper is trying to address.

Jointly addressing image fusion and super-resolution challenges
Overcoming low resolution and weak semantic information in multimodal images
Enhancing fusion results with cross-modal features and semantic guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic guidance and clarity-aware fusion
Conditional generation with bidirectional Mamba
Modified U-Net denoising for super-resolution
🔎 Similar Papers
No similar papers found.