Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in zero-shot pixel-level segmentation of remote sensing imagery—redundant masks from SAM, weak local object awareness in CLIP, and the absence of multi-scale aerial pretraining—this paper proposes VTPSeg, a vision-text collaborative prompting segmentation framework. Methodologically, it introduces (1) a novel dual-modal (vision-and-text) prompting joint filtering mechanism; (2) a GD+ bounding-box generation module coupled with CLIP++ semantic refinement, enabling end-to-end interpretable, low-redundancy open-vocabulary segmentation; and (3) a fusion of Grounding DINO+, CLIP Filter++, and FastSAM, incorporating multi-scale feature alignment, cross-modal prompt guidance, and bounding-box-driven mask generation. Evaluated on five mainstream remote sensing datasets, VTPSeg achieves an average mIoU improvement of 6.2%, reduces redundant masks by 73%, and significantly enhances segmentation accuracy, generalization capability, and inference efficiency.

Technology Category

Application Category

📝 Abstract
Pixel-level segmentation is essential in remote sensing, where foundational vision models like CLIP and Segment Anything Model(SAM) have demonstrated significant capabilities in zero-shot segmentation tasks. Despite their advances, challenges specific to remote sensing remain substantial. Firstly, The SAM without clear prompt constraints, often generates redundant masks, and making post-processing more complex. Secondly, the CLIP model, mainly designed for global feature alignment in foundational models, often overlooks local objects crucial to remote sensing. This oversight leads to inaccurate recognition or misplaced focus in multi-target remote sensing imagery. Thirdly, both models have not been pre-trained on multi-scale aerial views, increasing the likelihood of detection failures. To tackle these challenges, we introduce the innovative VTPSeg pipeline, utilizing the strengths of Grounding DINO, CLIP, and SAM for enhanced open-vocabulary image segmentation. The Grounding DINO+(GD+) module generates initial candidate bounding boxes, while the CLIP Filter++(CLIP++) module uses a combination of visual and textual prompts to refine and filter out irrelevant object bounding boxes, ensuring that only pertinent objects are considered. Subsequently, these refined bounding boxes serve as specific prompts for the FastSAM model, which executes precise segmentation. Our VTPSeg is validated by experimental and ablation study results on five popular remote sensing image segmentation datasets.
Problem

Research questions and friction points this paper is trying to address.

Addresses redundant mask generation in SAM for remote sensing.
Improves local object recognition in CLIP for multi-target imagery.
Enhances segmentation accuracy on multi-scale aerial views.
Innovation

Methods, ideas, or system contributions that make the work stand out.

VTPSeg pipeline integrates Grounding DINO, CLIP, SAM
CLIP++ filters objects with visual, textual prompts
FastSAM performs precise segmentation using refined prompts
🔎 Similar Papers
No similar papers found.
Xing Zi
Xing Zi
Researcher, University of Technology Sydney
Computer VisionRemote SensingMultimodal
K
Kairui Jin
University of Technology Sydney, Australia
X
Xian Tao
Chinese Academy of Sciences
J
Jun Li
University of Technology Sydney, Australia
Ali Braytee
Ali Braytee
University of Technology Sydney
machine learningoptimizationdata miningcomputational biology
R
Rajiv Ratn Shah
Indraprastha Institute of Information Technology, Delhi
M
Mukesh Prasad
University of Technology Sydney, Australia