🤖 AI Summary
Medical AI segmentation algorithms suffer from poor real-world generalizability, oversimplified evaluation protocols, and severe distributional shifts. Method: We introduce the first large-scale, multicenter abdominal organ CT segmentation benchmark—comprising 5,195 training cases from 76 hospitals and 5,903 diverse test cases from 11 unseen institutions. We pioneer an out-of-distribution (OOD), third-party, blinded evaluation paradigm to independently assess 19 state-of-the-art algorithms, including MONAI and nnU-Net. A standardized evaluation framework is established, unifying Dice score, 95th-percentile Hausdorff distance (HD95), and inference efficiency, alongside open-source preprocessing, evaluation APIs, and a sustainable assessment protocol. Results: Most advanced models exhibit substantial performance degradation under OOD conditions; nnU-Net demonstrates superior generalization. This benchmark provides the most authoritative, statistically robust, and clinically representative baseline for abdominal organ segmentation to date.
📝 Abstract
How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.