PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Clinical translation of pathology foundation models is hindered by cancer-type specificity, evaluation data leakage risks, and the absence of standardized benchmarks. To address these challenges, we introduce the first comprehensive benchmark for precision oncology—PathoBench—spanning the full clinical workflow from diagnosis to prognosis. It comprises 15,888 multi-institutional, private whole-slide images (WSIs) from 10 hospitals (8,549 patients) and 64 diverse tasks, with strict pretraining–evaluation data isolation. We propose a standardized, multi-cancer, end-to-end, leakage-resistant evaluation framework, integrating an automated real-time leaderboard and a multi-task assessment pipeline. Systematic evaluation of 19 state-of-the-art models identifies Virchow2 and H-Optimus-1 as top-performing across metrics. PathoBench provides a reproducible, clinically grounded evaluation platform for model development and objective, evidence-based model selection for clinical deployment.

Technology Category

Application Category

📝 Abstract

The emergence of pathology foundation models has revolutionized computational histopathology, enabling highly accurate, generalized whole-slide image analysis for improved cancer diagnosis, and prognosis assessment. While these models show remarkable potential across cancer diagnostics and prognostics, their clinical translation faces critical challenges including variability in optimal model across cancer types, potential data leakage in evaluation, and lack of standardized benchmarks. Without rigorous, unbiased evaluation, even the most advanced PFMs risk remaining confined to research settings, delaying their life-saving applications. Existing benchmarking efforts remain limited by narrow cancer-type focus, potential pretraining data overlaps, or incomplete task coverage. We present PathBench, the first comprehensive benchmark addressing these gaps through: multi-center in-hourse datasets spanning common cancers with rigorous leakage prevention, evaluation across the full clinical spectrum from diagnosis to prognosis, and an automated leaderboard system for continuous model assessment. Our framework incorporates large-scale data, enabling objective comparison of PFMs while reflecting real-world clinical complexity. All evaluation data comes from private medical providers, with strict exclusion of any pretraining usage to avoid data leakage risks. We have collected 15,888 WSIs from 8,549 patients across 10 hospitals, encompassing over 64 diagnosis and prognosis tasks. Currently, our evaluation of 19 PFMs shows that Virchow2 and H-Optimus-1 are the most effective models overall. This work provides researchers with a robust platform for model development and offers clinicians actionable insights into PFM performance across diverse clinical scenarios, ultimately accelerating the translation of these transformative technologies into routine pathology practice.

Problem

Research questions and friction points this paper is trying to address.

Evaluating pathology foundation models' performance across diverse cancer types

Addressing data leakage risks in model evaluation and benchmarking

Standardizing benchmarks for clinical diagnosis and prognosis tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-center datasets prevent data leakage

Automated leaderboard for continuous assessment

Large-scale data enables objective comparison

🔎 Similar Papers

Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation