🤖 AI Summary
In biomedical image segmentation validation, metrics such as the Hausdorff distance suffer from implementation inconsistencies across open-source toolkits, compromising benchmark reliability, introducing biomarker bias, and posing clinical deployment risks. To address this, we systematically evaluate 11 widely used toolkits and introduce, for the first time, a reference implementation based on high-fidelity 3D surface meshes. Our framework integrates real-world clinical data and a cross-platform consistency analysis. Statistical analysis reveals significant inter-tool variation in Hausdorff distance computations (p < 0.001), with interpolation strategy, boundary handling, and sampling density identified as primary sources of discrepancy. Based on these findings, we propose a reproducible and verifiable paradigm for distance-based evaluation, accompanied by standardized computational guidelines. This work substantially enhances the reliability, comparability, and clinical translatability of segmentation assessment.
📝 Abstract
The evaluation of segmentation performance is a common task in biomedical image analysis, with its importance emphasized in the recently released metrics selection guidelines and computing frameworks. To quantitatively evaluate the alignment of two segmentations, researchers commonly resort to counting metrics, such as the Dice similarity coefficient, or distance-based metrics, such as the Hausdorff distance, which are usually computed by publicly available open-source tools with an inherent assumption that these tools provide consistent results. In this study we questioned this assumption, and performed a systematic implementation analysis along with quantitative experiments on real-world clinical data to compare 11 open-source tools for distance-based metrics computation against our highly accurate mesh-based reference implementation. The results revealed that statistically significant differences among all open-source tools are both surprising and concerning, since they question the validity of existing studies. Besides identifying the main sources of variation, we also provide recommendations for distance-based metrics computation.