CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of the Segment Anything Model (SAM)—namely, its lack of semantic awareness and reliance on external prompts—by introducing a lightweight multimodal semantic adapter. The proposed adapter injects CLIP-derived textual, visual, and similarity features into SAM’s image encoder, enabling internal semantic conditioning while preserving the original prompting interface. This design supports both joint text–spatial prompting and purely text-driven semi-automatic segmentation, adhering to the principle of prompt consistency between training and inference. Combined with parameter-efficient fine-tuning (PEFT), the method achieves performance on par with or superior to existing approaches across both general and domain-specific tasks under low-label regimes, all while maintaining high parameter efficiency.
📝 Abstract
Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM's image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM's internal feature representations, allowing semantic information to influence mask prediction while preserving SAM's original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment.
Problem

Research questions and friction points this paper is trying to address.

semantic segmentation
promptable models
vision-language alignment
foundation models
semantic conditioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-efficient
semantic conditioning
promptable segmentation
vision-language integration
multi-modal adapters
🔎 Similar Papers
No similar papers found.