DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation

πŸ“… 2025-08-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address insufficient class diversity and significant domain shift between remote sensing (RS) and natural images in RS image segmentation, this paper proposes DGL-RSISβ€”a training-free framework. Methodologically, it introduces a novel global-local decoupling mechanism: (i) context-aware cropping and cross-scale Grad-CAM optimization enable vision-language cross-modal alignment; (ii) NLP techniques disentangle local nouns from global modifiers in text; (iii) unsupervised mask proposals, mask-guided feature matching, RS-specific prior integration, and pixel-to-mask activation fusion jointly refine segmentation. DGL-RSIS unifies open-vocabulary semantic segmentation and referring expression segmentation without fine-tuning. Evaluated on multiple RS benchmarks, it achieves state-of-the-art performance across both tasks, demonstrating substantially improved zero-shot transfer capability of vision-language models to the remote sensing domain.

Technology Category

Application Category

πŸ“ Abstract
The emergence of vision language models (VLMs) has bridged vision and language, enabling joint multimodal understanding beyond traditional visual-only deep learning models. However, transferring VLMs from the natural image domain to remote sensing (RS) segmentation remains challenging due to the limited category diversity in RS datasets and the domain gap between natural and RS imagery. Here, we propose a training-free framework, DGL-RSIS, that decouples visual and textual inputs, performing visual-language alignment at both the local semantic and global contextual levels through tailored strategies. Specifically, we first introduce a global-local decoupling (GLD) module, where text inputs are divided into local class nouns and global modifiers using natural language processing (NLP) techniques; image inputs are partitioned into a set of class-agnostic mask proposals via unsupervised mask proposal networks. Second, visual and textual features are aligned at local scale, through a novel context-aware cropping strategy for extracting image patches with proper boundaries and introducing RS-specific knowledge to enrich the text inputs. By matching the enhanced text features with mask-guided visual features, we enable the mask classification, supporting open-vocabulary semantic segmentation (OVSS). Third, at the global scale, we propose a Cross-Scale Grad-CAM module to refine Grad-CAM maps using contextual information from global modifiers. A subsequent mask selection module integrates pixel-level Grad-CAM activations into the mask-level segmentation output, such that accurate and interpretable alignment can be realized across global and local dimensions for referring expression segmentation (RES).
Problem

Research questions and friction points this paper is trying to address.

Transferring vision-language models to remote sensing image segmentation
Overcoming limited category diversity in remote sensing datasets
Bridging domain gap between natural and remote sensing imagery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples visual and textual inputs
Aligns features at local and global scales
Uses training-free framework for segmentation
πŸ”Ž Similar Papers
No similar papers found.
B
Boyi Li
School of Geographical Sciences, University of Bristol, University Road, Bristol BS8 1SS, U.K.
C
Ce Zhang
School of Geographical Sciences, University of Bristol, University Road, Bristol BS8 1SS, U.K.
R
Richard M. Timmerman
School of Geographical Sciences, University of Bristol, University Road, Bristol BS8 1SS, U.K.
Wenxuan Bao
Wenxuan Bao
University of Illinois Urbana-Champaign
federated learningtransfer learning