MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address weak cross-modal complementarity between high-resolution panchromatic (PAN) and low-resolution multispectral (MS) images, low computational efficiency, and insufficient fine-grained spatial-spectral correspondence in remote sensing image fusion, this paper proposes the first Mamba-based cross-modal contextual fusion framework. Our method introduces: (1) a Multimodal Interleaved scanning mechanism (MI) for efficient and precise PAN–MS feature interaction; (2) a context-conditioned fusion strategy enabling zero-shot super-resolution without task-specific training; and (3) linear-complexity state-space modeling to substantially reduce computational overhead. Evaluated on multiple standard benchmarks, our approach achieves state-of-the-art performance in both pan-sharpening and zero-shot image enhancement, with significant improvements in PSNR and SSIM metrics.

Technology Category

Application Category

📝 Abstract

Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Fuses high-resolution panchromatic and low-resolution multispectral images for pan-sharpening

Addresses computational inefficiency and fine-grained detail loss in cross-modal fusion

Enables zero-shot image super-resolution through efficient cross-modal interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Mamba architecture for linear computational complexity

Introduces multimodal interleaved scanning for cross-modal exchange

Enables zero-shot image super-resolution via in-context fusion

🔎 Similar Papers

No similar papers found.