🤖 AI Summary
Interactive medical image segmentation lacks a unified, clinically credible evaluation standard, leading to distorted algorithm comparisons and inaccurate performance assessment. This paper proposes a clinical-need-driven standardized evaluation framework that defines reproducible task paradigms and metrics. It systematically identifies— for the first time—the critical roles of information preservation, adaptive scaling, and training-validation prompt consistency in model robustness. The framework enables cross-domain comparative evaluation of both 2D and 3D models on multimodal data, slab-like structures, and irregular targets, while explicitly modeling user interaction behavior. Experiments demonstrate that 3D contextual modeling significantly improves segmentation accuracy for large-scale and irregular lesions; conversely, non-medical pre-trained models exhibit sharp performance degradation under low-contrast conditions and complex morphologies. This work establishes the first clinically grounded, fair benchmark for interactive segmentation evaluation.
📝 Abstract
Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.