🤖 AI Summary
Remote sensing image (RSI) editing faces two key challenges: insufficient diversity in existing benchmark datasets, hindering general-purpose editing, and ambiguous text–image semantic alignment, leading to erroneous semantics. This paper proposes the first text-guided RSI editing method trained on a single image—eliminating the need for large-scale remote sensing datasets. Our approach introduces three innovations: (1) a single-image multi-scale training paradigm to enhance cross-sensor generalization; (2) a remote sensing-specific Prompt Embedding (PE) mechanism that jointly optimizes editing stability and text controllability; and (3) an integrated framework combining multi-scale feature alignment, a remote sensing pre-trained vision-language model, and text-conditional diffusion fine-tuning. Experiments demonstrate high-fidelity, fine-grained editing across multi-source RSIs: content consistency improves by 32.7%, and editing accuracy increases by 26.4% over baselines—significantly outperforming natural-image adaptation methods.
📝 Abstract
Artificial intelligence generative content (AIGC) has significantly impacted image generation in the field of remote sensing. However, the equally important area of remote sensing image (RSI) editing has not received sufficient attention. Deep learning based editing methods generally involve two sequential stages: generation and editing. During the generation stage, consistency in content and details between the original and edited images must be maintained, while in the editing stage, controllability and accuracy of the edits should be ensured. For natural images, these challenges can be tackled by training generative backbones on large-scale benchmark datasets and using text guidance based on vision-language models (VLMs). However, these previously effective approaches become less viable for RSIs due to two reasons: First, existing generative RSI benchmark datasets do not fully capture the diversity of remote sensing scenarios, particularly in terms of variations in sensors, object types, and resolutions. Consequently, the generalization capacity of the trained backbone model is often inadequate for universal editing tasks on RSIs. Second, the large spatial resolution of RSIs exacerbates the problem in VLMs where a single text semantic corresponds to multiple image semantics, leading to the introduction of incorrect semantics when using text to guide RSI editing. To solve above problems, this paper proposes a text-guided RSI editing method that is controllable but stable, and can be trained using only a single image. It adopts a multi-scale training approach to preserve consistency without the need for training on extensive benchmark datasets, while leveraging RSI pre-trained VLMs and prompt ensembling (PE) to ensure accuracy and controllability in the text-guided editing process.