🤖 AI Summary
A lack of standardized, open-source benchmarks for multi-organ segmentation in multiphase T1-weighted (T1w) abdominal MRI hinders fair algorithm evaluation and clinical translation.
Method: We constructed the first standardized, open benchmark dataset comprising 40 clinical cases with four-phase T1w sequences and expert manual annotations for ten abdominal organs. We systematically evaluated three state-of-the-art deep learning-based segmentation tools—MRSegmentator, TotalSegmentator MRI, and TotalVibeSegmentator—using Dice similarity coefficient (DSC) and Hausdorff distance (HD), with statistical significance assessed via ANOVA and Tukey’s HSD tests.
Contribution/Results: MRSegmentator achieved superior overall performance across all phases (mean DSC: 80.7±18.6%; mean HD: 8.9±10.4 mm; p<0.05). This work establishes the first reproducible, open benchmark specifically designed for multiphase T1w abdominal MRI segmentation. The dataset, annotation protocol, and evaluation framework are publicly released to support algorithm development, validation, and clinical deployment.
📝 Abstract
The segmentation of multiple organs in multi-parametric MRI studies is critical for many applications in radiology, such as correlating imaging biomarkers with disease status (e.g., cirrhosis, diabetes). Recently, three publicly available tools, such as MRSegmentator (MRSeg), TotalSegmentator MRI (TS), and TotalVibeSegmentator (VIBE), have been proposed for multi-organ segmentation in MRI. However, the performance of these tools on specific MRI sequence types has not yet been quantified. In this work, a subset of 40 volumes from the public Duke Liver Dataset was curated. The curated dataset contained 10 volumes each from the pre-contrast fat saturated T1, arterial T1w, venous T1w, and delayed T1w phases, respectively. Ten abdominal structures were manually annotated in these volumes. Next, the performance of the three public tools was benchmarked on this curated dataset. The results indicated that MRSeg obtained a Dice score of 80.7 $pm$ 18.6 and Hausdorff Distance (HD) error of 8.9 $pm$ 10.4 mm. It fared the best ($p<.05$) across the different sequence types in contrast to TS and VIBE.