🤖 AI Summary
This study addresses the lack of systematic evaluation of AI performance in staging liver fibrosis within real-world, multicenter, and heterogeneous clinical settings. To this end, we constructed LiFS—the first large-scale multicenter dataset comprising complete gadoxetic acid–enhanced multiphase MRI sequences paired with histopathological reference standards—and leveraged the MICCAI 2025 CARE-Liver Challenge to systematically benchmark nine AI approaches. Through strategies including multiseries registration, multimodal fusion, and diverse backbone architectures with varying input dimensionalities, the top-performing model achieved diagnostic accuracy comparable to that of experienced radiologists and significantly outperformed junior readers. Our findings highlight inter-center heterogeneity, label imbalance, and variability in contrast-enhancement protocols as key challenges, offering critical benchmarks and insights for future clinical deployment of AI in liver fibrosis assessment.
📝 Abstract
Despite years of methodological progress, how far AI has come in liver fibrosis staging has never been systematically evaluated under the heterogeneous, multi-center conditions that define clinical practice. To address this gap, we introduce LiFS, a large-scale dataset and benchmark derived from the MICCAI 2025 CARE-Liver challenge, comprising 610 patients across multiple centers and scanners with multi-sequence MRI. To the best of our knowledge, LiFS is the first benchmark providing complete gadoxetic acid-enhanced sequences with histopathology-confirmed annotations from diverse real-world scanners. Through systematic evaluation of 9 independently developed methods selected from 96 registered teams against in-cohort radiologist reference results, our findings address how far current AI has progressed toward clinical-level liver fibrosis staging from three complementary perspectives. First, against radiologists, the best AI methods were broadly comparable to the senior radiologist and significantly exceeded the junior radiologist in selected settings, while median AI performance generally approached junior-radiologist levels. Second, from a data perspective, cross-center heterogeneity, label imbalance, and contrast-enhanced sequence variability emerge as the dominant challenges for AI methods. Third, from a technical perspective, methodological design choices, including spatial registration, input dimensionality, multi-modal fusion strategy, and backbone architecture, appear to modulate cross-center robustness, although no single choice alone closes the gap. Overall, LiFS provides a rigorous real-world benchmark for positioning the current state of AI in liver fibrosis staging and for enabling future research on the key challenges that limit clinically reliable deployment.