An Open Multi-Center Whole-Body FDG PET/CT Foundation Model for Tumor Segmentation

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitations of existing deep learning approaches for PET/CT imaging, which are typically task-specific, trained on single-center data, and employ dual-branch architectures that delay cross-modal interaction, thereby underutilizing the early spatial correspondence between PET and CT. The authors present the first open-source, multi-center foundation model for whole-body FDG PET/CT, integrating 4,997 standardized scans. Their architecture features a hierarchical UNet backbone with channel-wise early feature concatenation, enabling deep fusion of anatomical and metabolic information from the very first layer. To preserve physical plausibility, they introduce a masked autoencoding objective based on zero-mean imputation and a weighted global reconstruction loss that mitigates non-physical intensity discontinuities. Remarkably, with only 10% labeled data, the model matches the lesion segmentation performance of fully supervised baselines, and in 5-shot linear probing, joint pretraining substantially outperforms unimodal approaches, significantly reducing reliance on manual annotations.

📝 Abstract

The synergistic interpretation of anatomical information from computed tomography (CT) and metabolic information from positron emission tomography (PET) is important to oncologic imaging. However, existing deep learning methods for PET/CT remain largely task-specific, are often trained on single-center cohorts, or adopt dual-branch fusion schemes that delay cross-modal interaction and underutilize early spatial correspondence between PET and CT. To address these limitations, we present an open-source, multi-center, whole-body FDG PET/CT foundation model utilizing 4,997 harmonized scans from four public datasets. Our framework employs hierarchical UNet-shaped backbones with early channel-wise concatenation, enabling anatomical and metabolic features to interact from the first embedding layer onward. We further introduce a masked autoencoding objective based on zero-mean imputation, combined with a weighted global reconstruction loss. This design avoids non-physical intensity discontinuities at masked-region boundaries that arise from learnable mask tokens. On downstream AutoPET lesion segmentation, the proposed models demonstrate strong label efficiency: with only 10\% of the labeled training data, they achieve performance comparable to models trained from scratch on the full dataset. Under extreme 5-shot linear probing, joint PET/CT pretraining also achieves higher Dice scores than separated-modality pretraining. This multi-center foundation model demonstrates label efficiency and cross-modality representation learning for PET/CT tumor segmentation. It provides a robust, open-source basis for advancing automated oncologic imaging, significantly reducing the need for large-scale manual annotations in clinical practice.

Problem

Research questions and friction points this paper is trying to address.

PET/CT

tumor segmentation

cross-modality

foundation model

multi-center

Innovation

Methods, ideas, or system contributions that make the work stand out.

early cross-modal fusion

masked autoencoding

zero-mean imputation