BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of effectively fusing point and text prompts—complementary yet difficult to integrate—for image segmentation. We propose Dual-Prompt Collaborative Segmentation (DPC-Seg), which models the two prompt types as complementary experts. A lightweight, explicit cross-modal similarity gating mechanism dynamically selects the optimal mask from SAM’s multi-candidate outputs, synergizing text-guided semantic understanding with point-based spatial precision. Our core contribution is the first interpretable and generalizable explicit dual-modal selection paradigm, circumventing biases inherent in implicit fusion strategies. On EndoVis17, DPC-Seg achieves 89.55% mDice; on RefCOCO, RefCOCO+, and RefCOCOg, it attains IoU scores of 87.1%, 86.5%, and 85.8%, respectively—outperforming state-of-the-art methods significantly. Notably, it excels in complex semantic reasoning, fine-grained discrimination among visually similar objects, and segmentation under occlusion.

Technology Category

Application Category

📝 Abstract
Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility. The recent Segment Anything Model (SAM) has demonstrated powerful point-prompt segmentation capabilities, while text-based segmentation models offer rich semantic understanding. However, existing approaches rarely explore how to effectively combine these complementary modalities for optimal segmentation performance. This paper presents BiPrompt-SAM, a novel dual-modal prompt segmentation framework that fuses the advantages of point and text prompts through an explicit selection mechanism. Specifically, we leverage SAM's inherent ability to generate multiple mask candidates, combined with a semantic guidance mask from text prompts, and explicitly select the most suitable candidate based on similarity metrics. This approach can be viewed as a simplified Mixture of Experts (MoE) system, where the point and text modules act as distinct"experts,"and the similarity scoring serves as a rudimentary"gating network."We conducted extensive evaluations on both the Endovis17 medical dataset and RefCOCO series natural image datasets. On Endovis17, BiPrompt-SAM achieved 89.55% mDice and 81.46% mIoU, comparable to state-of-the-art specialized medical segmentation models. On the RefCOCO series datasets, our method attained 87.1%, 86.5%, and 85.8% IoU, significantly outperforming existing approaches. Experiments demonstrate that our explicit dual-selection method effectively combines the spatial precision of point prompts with the semantic richness of text prompts, particularly excelling in scenarios involving semantically complex objects, multiple similar objects, and partial occlusions. BiPrompt-SAM not only provides a simple yet effective implementation but also offers a new perspective on multi-modal prompt fusion.
Problem

Research questions and friction points this paper is trying to address.

Combining point and text prompts for optimal segmentation performance
Enhancing segmentation via explicit selection between dual-modal prompts
Improving accuracy in complex, multi-object, and occluded scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-modal prompt fusion with explicit selection
Combines point and text prompts via similarity metrics
Simplified Mixture of Experts for optimal segmentation
🔎 Similar Papers
No similar papers found.