🤖 AI Summary
To address weak cross-sensor generalization, high computational cost, and poor real-time deployability in pansharpening, this paper proposes a single-sample adaptive conditional fine-tuning framework. Our method introduces a lightweight, plug-and-play Conditional Adaptive Transformation (CAT) module, integrated with an instance-level unsupervised patch-wise adaptation mechanism, enabling millisecond-scale single-image adaptation on a pre-trained backbone. Without requiring labeled data from target sensors, it ensures both accuracy and efficiency via conditional feature cropping and patch-wise inference. Evaluated on real-world WorldView-2/3 datasets, our approach achieves state-of-the-art performance. On an RTX 3090 GPU, it processes 512×512 images in just 0.4 seconds and 4000×4000 images in only 3 seconds—demonstrating significant improvements in cross-sensor generalization and practical deployability.
📝 Abstract
Pansharpening is a crucial remote sensing technique that fuses low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images to generate high-resolution multispectral (HRMS) imagery. Although deep learning techniques have significantly advanced pansharpening, many existing methods suffer from limited cross-sensor generalization and high computational overhead, restricting their real-time applications. To address these challenges, we propose an efficient framework that quickly adapts to a specific input instance, completing both training and inference in a short time. Our framework splits the input image into multiple patches, selects a subset for unsupervised CAT training, and then performs inference on all patches, stitching them into the final output. The CAT module, integrated between the feature extraction and channel transformation stages of a pre-trained network, tailors the fused features and fixes the parameters for efficient inference, generating improved results. Our approach offers two key advantages: (1) $ extit{Improved Generalization Ability}$: by mitigating cross-sensor degradation, our model--although pre-trained on a specific dataset--achieves superior performance on datasets captured by other sensors; (2) $ extit{Enhanced Computational Efficiency}$: the CAT-enhanced network can swiftly adapt to the test sample using the single LRMS-PAN pair input, without requiring extensive large-scale data retraining. Experiments on the real-world data from WorldView-3 and WorldView-2 datasets demonstrate that our method achieves state-of-the-art performance on cross-sensor real-world data, while achieving both training and inference of $512 imes512$ image within $ extit{0.4 seconds}$ and $4000 imes4000$ image within $ extit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU.