π€ AI Summary
Current promptable medical image segmentation models lack systematic evaluation across varying prompt quality levels. Method: We conduct the first zero-shot benchmark on the BraTS 2023 multimodal MRI dataset, evaluating SAM, SAM 2, MedSAM, SAM-Med-3D, and nnU-Net. We compare point prompts against high-precision bounding box prompts and introduce pediatric tumor data for fine-tuning to enhance point-prompt performance. Contribution/Results: Under bounding box prompts, SAM and SAM 2 achieve Dice scores of 0.894 and 0.893βsurpassing nnU-Net. Fine-tuning on pediatric oncology data significantly improves point-prompt accuracy. Our analysis demonstrates that prompt quality critically governs model performance, affirming the viability of general-purpose vision foundation models in medical segmentation. This work provides empirical evidence and methodological guidance for prompt engineering in clinical imaging analysis.
π Abstract
Medical image segmentation has greatly aided medical diagnosis, with U-Net based architectures and nnU-Net providing state-of-the-art performance. There have been numerous general promptable models and medical variations introduced in recent years, but there is currently a lack of evaluation and comparison of these models across a variety of prompt qualities on a common medical dataset. This research uses Segment Anything Model (SAM), Segment Anything Model 2 (SAM 2), MedSAM, SAM-Med-3D, and nnU-Net to obtain zero-shot inference on the BraTS 2023 adult glioma and pediatrics dataset across multiple prompt qualities for both points and bounding boxes. Several of these models exhibit promising Dice scores, particularly SAM and SAM 2 achieving scores of up to 0.894 and 0.893, respectively when given extremely accurate bounding box prompts which exceeds nnU-Net's segmentation performance. However, nnU-Net remains the dominant medical image segmentation network due to the impracticality of providing highly accurate prompts to the models. The model and prompt evaluation, as well as the comparison, are extended through fine-tuning SAM, SAM 2, MedSAM, and SAM-Med-3D on the pediatrics dataset. The improvements in point prompt performance after fine-tuning are substantial and show promise for future investigation, but are unable to achieve better segmentation than bounding boxes or nnU-Net.