🤖 AI Summary
This study addresses the lack of standardized evaluation and unclear sensitivity to real human prompts that hinder the clinical deployment of promptable foundation models in medical image segmentation. For the first time, we introduce human-generated prompts to systematically evaluate eleven models on multi-site musculoskeletal CT segmentation tasks—specifically bone and implant segmentation in the wrist, shoulder, hip, and lower leg—using non-iterative 2D/3D prompting strategies and observer studies, with robustness and inter-annotator consistency assessed via Pareto front analysis. Results reveal that all models are highly sensitive to human prompts, exhibiting substantially degraded performance compared to ideal prompts and poor cross-annotator consistency. Among them, SAM and SAM2.1 achieve the best 2D performance, while nnInteractive and Med-SAM2 lead in 3D, offering empirical guidance for model selection in clinically relevant human-in-the-loop scenarios.
📝 Abstract
Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on"ideal"prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/