π€ AI Summary
This study addresses the lack of systematic evaluation of the cross-dataset generalizability of foundation models in breast cancer survival prediction. We establish the first large-scale, externally validated benchmark for pathology foundation models in survival analysis, systematically comparing multiple generations of models across three independent clinical cohorts (>5,400 patients) using a unified framework for patch-level feature extraction and survival modeling. Results demonstrate that second-generation models consistently outperform first-generation counterparts, with H-optimus-1 achieving the best performance. Notably, the distilled small model H0-mini attains superior predictive accuracy to its teacher model with only 8% of the parameters, offering an efficient yet effective alternative. The limited absolute performance gains among recent models suggest diminishing returns from further scaling of pretraining alone.
π Abstract
Pathology foundation models (PFMs) have recently emerged as powerful pretrained encoders for computational pathology, enabling transfer learning across a wide range of downstream tasks. However, systematic comparisons of these models for clinically meaningful prediction problems remain limited, especially in the context of survival prediction under external validation. In this study, we benchmark widely used and recently proposed PFMs for breast cancer survival prediction from whole-slide histopathology images. Using a standardized pipeline based on patch-level feature extraction and a unified survival modeling framework, we evaluate model representations across three independent clinical cohorts comprising more than 5,400 patients with long-term follow-up. Models are trained on one cohort and evaluated on two independent external cohorts, enabling a rigorous assessment of cross-dataset generalization. Overall, H-optimus-1 achieves the strongest survival prediction performance. More broadly, we observe consistent generational improvements across model families, with second-generation PFMs outperforming their first-generation counterparts. However, absolute performance differences between many recent PFMs remain modest, suggesting diminishing returns from further scaling of pretraining data or model size alone. Notably, the compact distilled model H0-mini slightly outperforms its larger teacher model H-optimus-0, despite using fewer than 8% of the parameters and enabling significantly faster feature extraction. Together, these results provide the first large-scale, externally validated benchmark of PFMs for breast cancer survival prediction, and offer practical guidance for efficient deployment of PFMs in clinical workflows.