🤖 AI Summary
This study presents the first systematic evaluation of the transferability of open-source medical foundation models on a six-class CT renal lesion classification task under data-scarce conditions. To address generalization challenges arising from limited training data, the authors employ a frozen feature extraction strategy, fine-tuning three foundation models on a composite dataset of 2,854 cases and evaluating them on an external test set of 234 cases. Performance is benchmarked against handcrafted radiomics and a 3D ResNet-50 trained from scratch. Results show that the foundation models achieve AUCs between 0.70 and 0.77—comparable to ResNet-50 but with lower computational cost—while radiomics significantly outperforms all deep learning approaches with an AUC of 0.88 (p ≤ 0.002), highlighting current limitations of general-purpose foundation models in capturing the subtle textural and morphological heterogeneity characteristic of renal lesions.
📝 Abstract
The rapid proliferation of open-source medical foundation models (FMs) raises a practical question: how well do their pre-trained representations transfer to clinically relevant but data-scarce classification tasks? Particularly in CT-based renal lesion classification, a push toward greater generalizability would be meaningful, as the field is constrained by inherently limited training data. We addressed this through a benchmark of three medical FMs on this specific task. This six-class problem spans common entities like cysts and clear cell renal cell carcinoma, alongside rare subtypes. Using a frozen feature-probing protocol, we compared FM embeddings against a handcrafted radiomics classifier and a 3D ResNet-50 trained from scratch. Models were trained on a composite dataset of 2,854 lesions and evaluated on an external test set of 234 lesions from The Cancer Imaging Archive. Our results reveal two key findings. First, FM performance (AUC 0.70-0.77) matched the from-scratch ResNet (AUC 0.72) while drastically reducing hardware demand, requiring only seconds on a CPU after feature extraction. However, the conventional radiomics baseline significantly outperformed all deep learning approaches, achieving an AUC of 0.88 (all p $\leq$ 0.002). This suggests that current generalist FM embeddings do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination. Despite their potential in data-scarce settings, medical FMs did not surpass established models for renal lesion stratification, leaving radiomics as the current state-of-the-art.