🤖 AI Summary
This work proposes FoundPS, the first satellite-agnostic and scene-robust foundation model for pan-sharpening, addressing the limited generalization of existing methods that are often tailored to specific satellites and scenes. FoundPS introduces several key innovations: a modality-interleaved Transformer that maps multi-spectral images with arbitrary spectral bands into a unified latent space, a reversible spectral affine basis to preserve spectral structure, a latent diffusion bridge combined with bridge posterior sampling to enhance fusion stability, and an infinite-dimensional pixel-latent interaction mechanism to improve fine-detail reconstruction. Evaluated on PSBench—a newly curated large-scale benchmark—FoundPS significantly outperforms state-of-the-art approaches and demonstrates exceptional generalization and robustness across diverse sensors and scenes.
📝 Abstract
Pansharpening generates the high-resolution multi-spectral (MS) image by integrating spatial details from a texture-rich panchromatic (PAN) image and spectral attributes from a low-resolution MS image. Existing methods are predominantly satellite-specific and scene-dependent, which severely limits their generalization across heterogeneous sensors and varied scenes, thereby reducing their real-world practicality. To address these challenges, we present FoundPS, a universal pansharpening foundation model for satellite-agnostic and scene-robust fusion. Specifically, we introduce a modality-interleaved transformer that learns band-wise modal specializations to form reversible spectral affine bases, mapping arbitrary-band MS into a unified latent space via tensor multiplication. Building upon this, we construct a latent diffusion bridge model to progressively evolve latent representations, and incorporate bridge posterior sampling to couple latent diffusion with pixel-space observations, enabling stable and controllable fusion. Furthermore, we devise infinite-dimensional pixel-to-latent interaction mechanisms to comprehensively capture the cross-domain dependencies between PAN observations and MS representations, thereby facilitating complementary information fusion. In addition, to support large-scale training and evaluation, we construct a comprehensive pansharpening benchmark, termed PSBench, consisting of worldwide MS and PAN image pairs from multiple satellites across diverse scenes. Extensive experiments demonstrate that FoundPS consistently outperforms state-of-the-art methods, exhibiting superior generalization and robustness across a wide range of pansharpening tasks.