🤖 AI Summary
This work addresses the challenge that prompt-based segmentation models like SAM, which rely on explicit spatial prompts, cannot be directly applied to fully automatic camouflaged object detection (COD). To overcome this limitation, the authors propose IP-SAM, the first method to achieve adaptive, external-prompt-free segmentation from a prompt-space perspective. IP-SAM introduces a self-prompt generator (SPG) to extract intrinsic image cues as region anchors and integrates them with a frozen SAM prompt encoder and a LoRA-finetuned image encoder, forming an end-to-end automatic segmentation framework. Additionally, a prompt-space gating (PSG) mechanism is designed to suppress background false positives. While preserving SAM’s prompt interface integrity, IP-SAM achieves state-of-the-art performance on four COD benchmarks—e.g., MAE = 0.017 on COD10K—with only 21.26M trainable parameters and demonstrates strong zero-shot transfer capability for medical polyp segmentation.
📝 Abstract
Prompt-conditioned foundation segmenters have emerged as a dominant paradigm for image segmentation, where explicit spatial prompts (e.g., points, boxes, masks) guide mask decoding. However, many real-world deployments require fully automatic segmentation, creating a structural mismatch: the decoder expects prompts that are unavailable at inference. Existing adaptations typically modify intermediate features, inadvertently bypassing the model's native prompt interface and weakening prompt-conditioned decoding. We propose IP-SAM, which revisits adaptation from a prompt-space perspective through prompt-space conditioning. Specifically, a Self-Prompt Generator (SPG) distills image context into complementary intrinsic prompts that serve as coarse regional anchors. These cues are projected through SAM2's frozen prompt encoder, restoring prompt-guided decoding without external intervention. To suppress background-induced false positives, Prompt-Space Gating (PSG) leverages the intrinsic background prompt as an asymmetric suppressive constraint prior to decoding. Under a deterministic no-external-prompt protocol, IP-SAM achieves state-of-the-art performance across four camouflaged object detection benchmarks (e.g., MAE 0.017 on COD10K) with only 21.26M trainable parameters (optimizing SPG, PSG, and a task-specific mask decoder trained from scratch, alongside image-encoder LoRA while keeping the prompt encoder frozen). Furthermore, the proposed conditioning strategy generalizes beyond COD to medical polyp segmentation, where a model trained solely on Kvasir-SEG exhibits strong zero-shot transfer to both CVC-ClinicDB and ETIS.