🤖 AI Summary
To address the scarcity of high-quality annotated data for tumor CT image analysis—constrained by privacy concerns and high annotation costs—this work introduces PASTA, a pan-tumor CT foundation model. Methodologically: (1) we propose PASTA-Gen, the first synthetic framework generating 30,000 high-fidelity CT volumes with pixel-level lesion annotations and paired structured radiology reports; (2) we design a vision-language joint pretraining paradigm integrating cross-modal transfer learning and few-shot fine-tuning. In terms of contributions and results, PASTA unifies support for 46 diverse clinical tasks—including segmentation, detection, staging, survival prediction, and report generation—achieving state-of-the-art performance on 45 tasks (with statistically significant improvements on 35). Remarkably, it attains substantial performance gains using only minimal real-world data. Both the PASTA model and the synthetic dataset are fully open-sourced.
📝 Abstract
Artificial intelligence-assisted imaging analysis has made substantial strides in tumor diagnosis and management. Here we present PASTA, a pan-tumor CT foundation model that achieves state-of-the-art performance on 45 of 46 representative oncology tasks -- including lesion segmentation, tumor detection in plain CT, tumor staging, survival prediction, structured report generation, and cross-modality transfer learning, significantly outperforming the second-best models on 35 tasks. This remarkable advancement is driven by our development of PASTA-Gen, an innovative synthetic tumor generation framework that produces a comprehensive dataset of 30,000 CT scans with pixel-level annotated lesions and paired structured reports, encompassing malignancies across ten organs and five benign lesion types. By leveraging this rich, high-quality synthetic data, we overcome a longstanding bottleneck in the development of CT foundation models -- specifically, the scarcity of publicly available, high-quality annotated datasets due to privacy constraints and the substantial labor required for scaling precise data annotation. Encouragingly, PASTA demonstrates exceptional data efficiency with promising practical value, markedly improving performance on various tasks with only a small amount of real-world data. The open release of both the synthetic dataset and PASTA foundation model effectively addresses the challenge of data scarcity, thereby advancing oncological research and clinical translation.