Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

📅 2024-11-06
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Medical AI segmentation algorithms suffer from poor real-world generalizability, oversimplified evaluation protocols, and severe distributional shifts. Method: We introduce the first large-scale, multicenter abdominal organ CT segmentation benchmark—comprising 5,195 training cases from 76 hospitals and 5,903 diverse test cases from 11 unseen institutions. We pioneer an out-of-distribution (OOD), third-party, blinded evaluation paradigm to independently assess 19 state-of-the-art algorithms, including MONAI and nnU-Net. A standardized evaluation framework is established, unifying Dice score, 95th-percentile Hausdorff distance (HD95), and inference efficiency, alongside open-source preprocessing, evaluation APIs, and a sustainable assessment protocol. Results: Most advanced models exhibit substantial performance degradation under OOD conditions; nnU-Net demonstrates superior generalization. This benchmark provides the most authoritative, statistically robust, and clinically representative baseline for abdominal organ segmentation to date.

Technology Category

Application Category

📝 Abstract
How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.
Problem

Research questions and friction points this paper is trying to address.

Medical Image Segmentation
AI Algorithm Evaluation
Real-world Application
Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical Image Segmentation
AI Algorithm Performance
Diverse CT Image Dataset
🔎 Similar Papers
No similar papers found.
P
Pedro R. A. S. Bassi
Department of Computer Science, Johns Hopkins University; Department of Pharmacy and Biotechnology, University of Bologna; Center for Biomolecular Nanotechnologies, Istituto Italiano di Tecnologia
Wenxuan Li
Wenxuan Li
Johns Hopkins University
Imaging InformaticsComputer-aided Diagnosis
Yucheng Tang
Yucheng Tang
Sr. Research Scientist at NVIDIA
3D Computer VisionVision-Language ModelHealthcare AIAccelerated Computing
Fabian Isensee
Fabian Isensee
HIP Applied Computer Vision Lab, Division of Medical Image Computing, German Cancer Research Center
Computer VisionDeep LearningSegmentationMedical Image Computing
Zifu Wang
Zifu Wang
Shanghai AI Laboratory
Large Language Models
Jieneng Chen
Jieneng Chen
Johns Hopkins University
computer visionworld modelshealthrobotics
Yu-Cheng Chou
Yu-Cheng Chou
Johns Hopkins University
MLLMReinforcement LearningComputer Vision
Saikat Roy
Saikat Roy
Doctoral Researcher, German Cancer Research Center (DKFZ)
Deep LearningImage SegmentationRepresentation LearningDiffusion ModelsMedical Image Analysis
Yannick Kirchhoff
Yannick Kirchhoff
PhD Student, DKFZ
Computer VisionDeep LearningMedical Image Computing
M
Maximilian R. Rokuss
Division of Medical Image Computing, German Cancer Research Center (DKFZ); Helmholtz Imaging, German Cancer Research Center (DKFZ)
Z
Ziyan Huang
Peking University
J
Jin Ye
Peking University
Junjun He
Junjun He
Shanghai Jiao Tong University
Tassilo Wald
Tassilo Wald
PhD Student, Deutsche Krebsforschungszentrum (DKFZ)
representation learningself-supervised learningmedical image analysis
Constantin Ulrich
Constantin Ulrich
German Cancer Research Center (DKFZ)
Medical Image ComputingMedical physicsComputer Vision
M
Michael Baumgartner
Division of Medical Image Computing, German Cancer Research Center (DKFZ); Helmholtz Imaging, German Cancer Research Center (DKFZ)
Klaus H. Maier-Hein
Klaus H. Maier-Hein
Professor, Medical Image Computing, German Cancer Research Center
Medical Image AnalysisMachine Learning
P
Paul Jaeger
Helmholtz Imaging, German Cancer Research Center (DKFZ)
Y
Yiwen Ye
University of Electronic Science and Technology of China
Y
Yutong Xie
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
J
Jianpeng Zhang
Huazhong University of Science and Technology
Ziyang Chen
Ziyang Chen
Peking University
Quantum key distributionQuantum random number generation
Y
Yong Xia
University of Electronic Science and Technology of China
Zhaohu Xing
Zhaohu Xing
Hong Kong University of Science and Technology (Guangzhou)
Medical Image AnalysisVideo UnderstandingImage Generation
L
Lei Zhu
Georgia Institute of Technology
Y
Yousef Sadegheih
University of Tehran
Afshin Bozorgpour
Afshin Bozorgpour
Sharif University of Technology
Deep LearningComputer VisionImage Processing
Pratibha Kumari
Pratibha Kumari
University of Regensburg
Continual learningAnomaly detectionAdaptive learningConcept driftSurveillance
R
Reza Azad
University of Bonn
Dorit Merhof
Dorit Merhof
Professor, Faculty of Informatics and Computer Science, University of Regensburg
P
Pengcheng Shi
University of Texas at San Antonio
Ting Ma
Ting Ma
Harbin Institute of Technology (Shenzhen)
Computational neuroscienceneuroimagebrain-computer-interfacemedical image analysis
Y
Yuxin Du
Peking University
F
Fan Bai
Peking University
Tiejun Huang
Tiejun Huang
Professor,School of Computer Science, Peking University
Visual Information Processing
B
Bo Zhao
Peking University
H
Haonan Wang
Georgia Institute of Technology
Xiaomeng Li
Xiaomeng Li
Assistant Professor, The Hong Kong University of Science and Technology
Medical Image AnalysisAI in HealthcareDeep Learning
Hanxue Gu
Hanxue Gu
Duke University
Medical imagingDeep learningMachine learning
H
Haoyu Dong
Harbin Institute of Technology
Jichen Yang
Jichen Yang
National University of Singapore
anti-spoofing and speaker recognition
Maciej A. Mazurowski
Maciej A. Mazurowski
Associate Professor of Biostatistics & Bioinformatics, Radiology, Comp. Sci., ECE, Duke University
Machine LearningArtificial IntelligenceMedical Imaging
S
Saumya Gupta
Duke University
L
Linshan Wu
Georgia Institute of Technology
Jiaxin Zhuang
Jiaxin Zhuang
PhD in CSE, HKUST
Computer VisionMedical Image AnalysisArtificial Intelligence
H
Haoyang Chen
Tsinghua University
H
Holger Roth
NVIDIA
Daguang Xu
Daguang Xu
Senior Research Manager at NVIDIA
Deep LearningMachine LearningMedical Image AnalysisCompressive SensingSparse coding
M
Matthew B. Blaschko
NVIDIA
Sergio Decherchi
Sergio Decherchi
Facility Coordinator, Fondazione Istituto Italiano di Tecnologia
machine learninghigh performance computingcomputational chemistryapplied math
Andrea Cavalli
Andrea Cavalli
Director, CECAM-EPFL - Professor, University of Bologna
Molecular DynamicsComputational ChemistryDrug DiscoveryCancerAlzheimer's disease
A
Alan L. Yuille
Department of Computer Science, Johns Hopkins University
Z
Zongwei Zhou
Department of Computer Science, Johns Hopkins University