Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of standardized evaluation and unclear sensitivity to real human prompts that hinder the clinical deployment of promptable foundation models in medical image segmentation. For the first time, we introduce human-generated prompts to systematically evaluate eleven models on multi-site musculoskeletal CT segmentation tasks—specifically bone and implant segmentation in the wrist, shoulder, hip, and lower leg—using non-iterative 2D/3D prompting strategies and observer studies, with robustness and inter-annotator consistency assessed via Pareto front analysis. Results reveal that all models are highly sensitive to human prompts, exhibiting substantially degraded performance compared to ideal prompts and poor cross-annotator consistency. Among them, SAM and SAM2.1 achieve the best 2D performance, while nnInteractive and Med-SAM2 lead in 3D, offering empirical guidance for model selection in clinically relevant human-in-the-loop scenarios.

Technology Category

Application Category

📝 Abstract
Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on"ideal"prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/
Problem

Research questions and friction points this paper is trying to address.

foundation models
medical image segmentation
prompt sensitivity
human-in-the-loop
musculoskeletal CT
Innovation

Methods, ideas, or system contributions that make the work stand out.

promptable foundation models
human-in-the-loop prompting
medical image segmentation
model sensitivity
musculoskeletal CT
🔎 Similar Papers
No similar papers found.
C
Caroline Magg
Quantitative Healthcare Analysis (QurAI) Group, University of Amsterdam, Science Park 900, Amsterdam, 1098 XH, The Netherlands
M
Maaike A. ter Wee
Department Biomedical Engineering and Physics, Amsterdam UMC, Meibergdreef 9, Amsterdam, 1105 AZ, The Netherlands
J
Johannes G. G. Dobbe
Department Biomedical Engineering and Physics, Amsterdam UMC, Meibergdreef 9, Amsterdam, 1105 AZ, The Netherlands
G
Geert J. Streekstra
Department Biomedical Engineering and Physics, Amsterdam UMC, Meibergdreef 9, Amsterdam, 1105 AZ, The Netherlands
L
Leendert Blankevoort
Department Orthopaedics, Amsterdam UMC, Meibergdreef 9, Amsterdam, 1105 AZ, The Netherlands
Clara I. Sánchez
Clara I. Sánchez
Full Professor, Informatics Institute, University of Amsterdam
Computer-Aided DiagnosisMedical Image Analysis
Hoel Kervadec
Hoel Kervadec
Universiteit van Amsterdam
Computer visionMedical image analysisWeak supervision