π€ AI Summary
To address insufficient class diversity and significant domain shift between remote sensing (RS) and natural images in RS image segmentation, this paper proposes DGL-RSISβa training-free framework. Methodologically, it introduces a novel global-local decoupling mechanism: (i) context-aware cropping and cross-scale Grad-CAM optimization enable vision-language cross-modal alignment; (ii) NLP techniques disentangle local nouns from global modifiers in text; (iii) unsupervised mask proposals, mask-guided feature matching, RS-specific prior integration, and pixel-to-mask activation fusion jointly refine segmentation. DGL-RSIS unifies open-vocabulary semantic segmentation and referring expression segmentation without fine-tuning. Evaluated on multiple RS benchmarks, it achieves state-of-the-art performance across both tasks, demonstrating substantially improved zero-shot transfer capability of vision-language models to the remote sensing domain.
π Abstract
The emergence of vision language models (VLMs) has bridged vision and language, enabling joint multimodal understanding beyond traditional visual-only deep learning models. However, transferring VLMs from the natural image domain to remote sensing (RS) segmentation remains challenging due to the limited category diversity in RS datasets and the domain gap between natural and RS imagery. Here, we propose a training-free framework, DGL-RSIS, that decouples visual and textual inputs, performing visual-language alignment at both the local semantic and global contextual levels through tailored strategies. Specifically, we first introduce a global-local decoupling (GLD) module, where text inputs are divided into local class nouns and global modifiers using natural language processing (NLP) techniques; image inputs are partitioned into a set of class-agnostic mask proposals via unsupervised mask proposal networks. Second, visual and textual features are aligned at local scale, through a novel context-aware cropping strategy for extracting image patches with proper boundaries and introducing RS-specific knowledge to enrich the text inputs. By matching the enhanced text features with mask-guided visual features, we enable the mask classification, supporting open-vocabulary semantic segmentation (OVSS). Third, at the global scale, we propose a Cross-Scale Grad-CAM module to refine Grad-CAM maps using contextual information from global modifiers. A subsequent mask selection module integrates pixel-level Grad-CAM activations into the mask-level segmentation output, such that accurate and interpretable alignment can be realized across global and local dimensions for referring expression segmentation (RES).