ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high manual prompting cost and poor domain adaptability of Segment Anything Model (SAM) in remote sensing image segmentation, this paper proposes ViRefSAM—a reference-guided, class-aware segmentation framework requiring no hand-crafted point or box prompts, only a few annotated reference images. Methodologically, we design a visual context prompting encoder to generate semantic prompts conditioned on reference images, and introduce cross-image interaction modules alongside dynamic target alignment adapters to mitigate domain shift between natural and remote sensing imagery while enhancing generalization to unseen categories. Crucially, ViRefSAM preserves SAM’s original backbone and seamlessly integrates with few-shot learning paradigms. Extensive experiments on iSAID-5$^i$, LoveDA-2$^i$, and COCO-20$^i$ benchmarks demonstrate substantial improvements over state-of-the-art few-shot segmentation methods, validating ViRefSAM’s superior accuracy, minimal annotation dependency, and strong cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract
The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially fragmented distributions. Second, SAM lacks domain adaptability, as it is pre-trained primarily on natural images and struggles to capture RS-specific semantics and spatial characteristics, especially when segmenting novel or unseen classes. To address these issues, inspired by few-shot learning, we propose ViRefSAM, a novel framework that guides SAM utilizing only a few annotated reference images that contain class-specific objects. Without requiring manual prompts, ViRefSAM enables automatic segmentation of class-consistent objects across RS images. Specifically, ViRefSAM introduces two key components while keeping SAM's original architecture intact: (1) a Visual Contextual Prompt Encoder that extracts class-specific semantic clues from reference images and generates object-aware prompts via contextual interaction with target images; and (2) a Dynamic Target Alignment Adapter, integrated into SAM's image encoder, which mitigates the domain gap by injecting class-specific semantics into target image features, enabling SAM to dynamically focus on task-relevant regions. Extensive experiments on three few-shot segmentation benchmarks, including iSAID-5$^i$, LoveDA-2$^i$, and COCO-20$^i$, demonstrate that ViRefSAM enables accurate and automatic segmentation of unseen classes by leveraging only a few reference images and consistently outperforms existing few-shot segmentation methods across diverse datasets.
Problem

Research questions and friction points this paper is trying to address.

Automating prompt-free segmentation for remote sensing images
Enhancing SAM's domain adaptability to remote sensing data
Enabling few-shot learning for unseen class segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses few-shot learning for automatic segmentation
Integrates Visual Contextual Prompt Encoder
Employs Dynamic Target Alignment Adapter
🔎 Similar Papers
No similar papers found.
H
Hanbo Bi
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China; School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China; Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
Y
Yulong Xu
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China; School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China; Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
Y
Ya Li
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China; Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
Y
Yongqiang Mao
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
B
Boyuan Tong
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China; School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China; Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
C
Chongyang Li
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China; School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China; Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
C
Chunbo Lang
School of Automation, Northwestern Polytechnical University, Xi’an 710129, China
Wenhui Diao
Wenhui Diao
Aerospace Information Research Institute, Chinese Academy of Sciences
Object Detection
H
Hongqi Wang
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China; Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
Yingchao Feng
Yingchao Feng
Aerospace Information Research Institute, Chinese Academy of Sciences
Machine learning in visionStatistical and structural pattern recognitionImage/video analysis and understandingRemote sensing image understandingMachine learning and data mining with applications to remote sensing
Xian Sun
Xian Sun
Aerospace Information Research Institute, Chinese Academy of Sciences
Remote SensingComputer Vision and Pattern RecognitionArtificial Intelligence