🤖 AI Summary
This study addresses the challenge of generalizing fully automated lesion segmentation to unseen tracer–center combinations in multi-tracer, multi-center PET/CT data by launching the autoPET3 challenge. We curated a large-scale annotated dataset encompassing two tracers—FDG and PSMA—and two medical centers, and systematically evaluated algorithmic generalization on a test set containing previously unobserved combinations. Our analysis reveals, for the first time on a large-scale multi-tracer PET/CT benchmark, that data heterogeneity and case difficulty exert a greater impact on performance than algorithm choice. A 3D multimodal nnU-Net model integrating PET and CT achieved an average Dice Similarity Coefficient (DSC) of 0.66, false-negative volume (FNV) of 3.18 mL, and false-positive volume (FPV) of 2.78 mL across four test conditions—representing an 8% DSC improvement and a 5 mL FNV reduction over the baseline, approaching inter-reader agreement levels. We also release the largest publicly available annotated PSMA PET/CT dataset to date.
📝 Abstract
We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital Tübingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.