🤖 AI Summary
To address the poor cross-sensor generalization of deep learning-based pansharpening models, this paper proposes a feature-level adaptation method that requires no additional training data. Our approach decouples the fusion process modularly, incorporates physics-aware unsupervised losses, customizes residual features, and enables block-wise parallel inference—augmented by a plug-and-play feature cropping module that efficiently refines feature-level fusion. The method achieves end-to-end training and inference in under one second (0.2 s on an RTX 3090), accelerating zero-shot baselines by over two orders of magnitude while substantially reducing computational cost. Evaluated on diverse real-world remote sensing datasets, it attains state-of-the-art performance, demonstrating strong robustness and efficiency for 512×512×8 multispectral images. Overall, our solution provides a lightweight, plug-and-play framework for cross-sensor pansharpening.
📝 Abstract
Deep learning methods for pansharpening have advanced rapidly, yet models pretrained on data from a specific sensor often generalize poorly to data from other sensors. Existing methods to tackle such cross-sensor degradation include retraining model or zero-shot methods, but they are highly time-consuming or even need extra training data. To address these challenges, our method first performs modular decomposition on deep learning-based pansharpening models, revealing a general yet critical interface where high-dimensional fused features begin mapping to the channel space of the final image. % may need revisement A Feature Tailor is then integrated at this interface to address cross-sensor degradation at the feature level, and is trained efficiently with physics-aware unsupervised losses. Moreover, our method operates in a patch-wise manner, training on partial patches and performing parallel inference on all patches to boost efficiency. Our method offers two key advantages: (1) $ extit{Improved Generalization Ability}$: it significantly enhance performance in cross-sensor cases. (2) $ extit{Low Generalization Cost}$: it achieves sub-second training and inference, requiring only partial test inputs and no external data, whereas prior methods often take minutes or even hours. Experiments on the real-world data from multiple datasets demonstrate that our method achieves state-of-the-art quality and efficiency in tackling cross-sensor degradation. For example, training and inference of $512 imes512 imes8$ image within $ extit{0.2 seconds}$ and $4000 imes4000 imes8$ image within $ extit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU, which is over 100 times faster than zero-shot methods.