CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation

๐Ÿ“… 2025-06-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address inaccurate detail localization, over-reliance on geometric prompts (e.g., points or bounding boxes), and spatial information loss in multi-organ medical image segmentation, this paper proposes CRISP-SAM2. The model introduces a cross-modal interaction mechanism and a semantic prompting strategy, replacing conventional geometric prompts with text-based semantic guidance. It incorporates progressive cross-attention, semantic prompt encoding, similarity-driven memory self-updating, and mask refinement modules to achieve deep visualโ€“language fusion and precise local structural modeling. Evaluated on seven public benchmarks, CRISP-SAM2 significantly outperforms state-of-the-art methods, particularly in small-organ segmentation and boundary detail recovery. Results demonstrate the effectiveness and robustness of the semantic-driven segmentation paradigm.

Technology Category

Application Category

๐Ÿ“ Abstract
Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CRoss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: https://github.com/YU-deep/CRISP_SAM2.git.
Problem

Research questions and friction points this paper is trying to address.

Improves multi-organ segmentation accuracy and detail preservation
Reduces reliance on geometric prompts for organ segmentation
Enhances spatial information retention in medical imaging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive cross-attention for cross-modal semantics
Semantic prompting replaces geometric prompts
Self-updating memory and mask-refining enhance details
๐Ÿ”Ž Similar Papers
No similar papers found.