SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of the Segment Anything Model (SAM) in comprehending natural language for referring expression segmentation (RES). To this end, the authors propose a Semantic-Spatial Prompt (SSP) encoder that, for the first time, integrates a joint semantic-spatial prompting mechanism into the SAM architecture. By leveraging a vision-language attention adapter to fuse multimodal information, the method naturally supports both general referring expression segmentation (GRES) and open-vocabulary scenarios without modifying SAM’s backbone. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance across multiple RES and GRES benchmarks, significantly outperforming existing methods under stringent metrics such as Precision@0.9 and setting a new record on the PhraseCut dataset.

Technology Category

Application Category

📝 Abstract
The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM's segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: https://github.com/WayneTomas/SSP-SAM.
Problem

Research questions and friction points this paper is trying to address.

Referring Expression Segmentation
Segment Anything Model
natural language understanding
semantic-spatial prompt
image segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Spatial Prompt
Referring Expression Segmentation
Segment Anything Model
Visual-Linguistic Attention
Generalized RES
🔎 Similar Papers
No similar papers found.