A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the high diagnostic uncertainty and consequent overtreatment in non-invasive evaluation of renal masses, this study develops the first end-to-end intelligent renal tumor analysis system encompassing imaging analysis, diagnostic decision-making, and prognostic prediction. We propose a disease-centric two-stage pretraining paradigm that incorporates domain knowledge to enhance both visual and textual encoders, and employ contrastive learning for cross-modal alignment. Built upon a vision-language foundation model architecture, the system integrates domain-adaptive pretraining and zero-shot transfer capabilities. Evaluated across ten clinical tasks, it consistently outperforms state-of-the-art methods: achieving a C-index of 0.726 for recurrence-free survival prediction on the TCIA cohort, and attaining full-label baseline performance using only 20% labeled data—thereby significantly improving diagnostic accuracy and enabling more precise, individualized patient management.

Technology Category

Application Category

📝 Abstract

The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a visual-language foundation model for characterization, diagnosis and prognosis of renal mass. The model was developed via a two-stage pre-training strategy that first enhances the image and text encoders with domain-specific knowledge before aligning them through a contrastive learning objective, to create robust representations for superior generalization and diagnostic precision. RenalCLIP achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer, including anatomical assessment, diagnostic classification, and survival prediction, compared with other state-of-the-art general-purpose CT foundation models. Especially, for complicated task like recurrence-free survival prediction in the TCIA cohort, RenalCLIP achieved a C-index of 0.726, representing a substantial improvement of approximately 20% over the leading baselines. Furthermore, RenalCLIP's pre-training imparted remarkable data efficiency; in the diagnostic classification task, it only needs 20% training data to achieve the peak performance of all baseline models even after they were fully fine-tuned on 100% of the data. Additionally, it achieved superior performance in report generation, image-text retrieval and zero-shot diagnosis tasks. Our findings establish that RenalCLIP provides a robust tool with the potential to enhance diagnostic accuracy, refine prognostic stratification, and personalize the management of patients with kidney cancer.

Problem

Research questions and friction points this paper is trying to address.

Non-invasive assessment of renal masses to reduce diagnostic uncertainty

Developing a vision-language model for kidney cancer diagnosis and prognosis

Improving generalization and data efficiency in renal mass characterization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language foundation model for kidney cancer

Two-stage pre-training with contrastive learning

Superior generalization across 10 clinical tasks

🔎 Similar Papers

Deep Transfer Learning for Kidney Cancer Diagnosis