🤖 AI Summary
To address three key challenges in retinal image segmentation—ambiguous textual descriptions, SAM’s reliance on manual prompts, and the absence of a unified multimodal framework—this paper proposes CLAPS, the first fully automatic, cross-task, and cross-modal retinal segmentation method. Methodologically, CLAPS introduces modality signatures to enhance textual prompt representation; couples the CLIP image encoder with GroundingDINO to autonomously detect lesions and generate spatial prompts; and jointly leverages textual and spatial prompts to drive SAM for end-to-end segmentation. Evaluated across 12 public datasets covering 11 critical clinical tasks, CLAPS achieves performance on par with expert-annotated models and significantly outperforms existing methods on most metrics. Its robust generalization across diverse anatomical structures, pathologies, and imaging modalities demonstrates strong clinical applicability and establishes a new benchmark for automated, multimodal retinal analysis.
📝 Abstract
Recent advancements in foundation models, such as the Segment Anything Model (SAM), have significantly impacted medical image segmentation, especially in retinal imaging, where precise segmentation is vital for diagnosis. Despite this progress, current methods face critical challenges: 1) modality ambiguity in textual disease descriptions, 2) a continued reliance on manual prompting for SAM-based workflows, and 3) a lack of a unified framework, with most methods being modality- and task-specific. To overcome these hurdles, we propose CLIP-unified Auto-Prompt Segmentation (CLAPS), a novel method for unified segmentation across diverse tasks and modalities in retinal imaging. Our approach begins by pre-training a CLIP-based image encoder on a large, multi-modal retinal dataset to handle data scarcity and distribution imbalance. We then leverage GroundingDINO to automatically generate spatial bounding box prompts by detecting local lesions. To unify tasks and resolve ambiguity, we use text prompts enhanced with a unique "modality signature" for each imaging modality. Ultimately, these automated textual and spatial prompts guide SAM to execute precise segmentation, creating a fully automated and unified pipeline. Extensive experiments on 12 diverse datasets across 11 critical segmentation categories show that CLAPS achieves performance on par with specialized expert models while surpassing existing benchmarks across most metrics, demonstrating its broad generalizability as a foundation model.