SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing Segment Anything Models (SAM) excel at spatial prompting (e.g., points or boxes) but underutilize semantic text prompts. To address this, we propose SAM-PTx: a parameter-efficient adaptation that freezes the SAM image encoder and introduces a lightweight Parallel-Text adapter—modifying only the MLP branches within the Transformer—to inject pre-extracted CLIP text embeddings in a parallel, decoupled manner from spatial pathways. This enables explicit semantic–spatial pathway separation. To our knowledge, SAM-PTx is the first method validated for text-prompted segmentation on COD10K. Experiments demonstrate substantial improvements over pure spatial prompting baselines across low-data subsets of COD10K, COCO, and ADE20K, achieving higher segmentation accuracy with minimal computational overhead and strong scalability.

Technology Category

Application Category

📝 Abstract

The Segment Anything Model (SAM) has demonstrated impressive generalization in prompt-based segmentation. Yet, the potential of semantic text prompts remains underexplored compared to traditional spatial prompts like points and boxes. This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance. Specifically, we propose a lightweight adapter design called Parallel-Text that injects text embeddings into SAM's image encoder, enabling semantics-guided segmentation while keeping most of the original architecture frozen. Our adapter modifies only the MLP-parallel branch of each transformer block, preserving the attention pathway for spatial reasoning. Through supervised experiments and ablations on the COD10K dataset as well as low-data subsets of COCO and ADE20K, we show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines. To our knowledge, this is the first work to use text prompts for segmentation on the COD10K dataset. These results suggest that integrating semantic conditioning into SAM's architecture offers a practical and scalable path for efficient adaptation with minimal computational complexity.

Problem

Research questions and friction points this paper is trying to address.

Enhancing SAM with text prompts for semantic segmentation

Parameter-efficient adaptation using frozen CLIP text embeddings

Improving segmentation via semantics while preserving spatial reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient fine-tuning of SAM

Parallel-Text adapter for text embeddings

Semantic text prompts enhance segmentation

🔎 Similar Papers

No similar papers found.