🤖 AI Summary
This work addresses the critical barriers to deploying clinical-grade AI biomarker models in computational pathology—namely, the absence of standardized intermediate representations, provenance tracking, and reproducible evaluation frameworks. To overcome these challenges, we establish the first shared benchmark framework for computational pathology, leveraging The Cancer Genome Atlas (TCGA) cohort to provide structured intermediate representations, predefined data splits, trained models, and evaluation metrics, with cross-validation on an independent Memorial Sloan Kettering Cancer Center (MSKCC) cohort. The framework integrates pathology foundation models (PFMs) to extract features from H&E whole-slide images, combined with multiple instance learning, quality control metadata, spatial coordinate mapping, and OncoKB annotations. Across 33 tumor–biomarker tasks, a high-performing subset of eight tasks achieved mean AUROCs of 0.831 on TCGA and 0.801 on MSKCC, demonstrating cross-institutional stability and establishing a reproducible, comparable foundation for AI-driven biomarker development.
📝 Abstract
Computational biomarkers (CBs) are histopathology-derived patterns extracted from hematoxylin-eosin (H&E) whole-slide images (WSIs) using artificial intelligence (AI) to predict therapeutic response or prognosis. Recently, slide-level multiple-instance learning (MIL) with pathology foundation models (PFMs) has become the standard baseline for CB development. While these methods have improved predictive performance, computational pathology lacks standardized intermediate data formats, provenance tracking, checkpointing conventions, and reproducible evaluation metrics required for clinical-grade deployment.
We introduce GOLDMARK (https://artificialintelligencepathology.org), a standardized benchmarking framework built on a curated TCGA cohort with clinically actionable OncoKB level 1-3 biomarker labels. GOLDMARK releases structured intermediate representations, including tile coordinate maps, per-slide feature embeddings from canonical PFMs, quality-control metadata, predefined patient-level splits, trained slide-level models, and evaluation outputs. Models are trained on TCGA and evaluated on an independent MSKCC cohort with reciprocal testing.
Across 33 tumor-biomarker tasks, mean AUROC was 0.689 (TCGA) and 0.630 (MSKCC). Restricting to the eight highest-performing tasks yielded mean AUROCs of 0.831 and 0.801, respectively. These tasks correspond to established morphologic-genomic associations (e.g., LGG IDH1, COAD MSI/BRAF, THCA BRAF/NRAS, BLCA FGFR3, UCEC PTEN) and showed the most stable cross-site performance. Differences between canonical encoders were modest relative to task-specific variability.
GOLDMARK establishes a shared experimental substrate for computational pathology, enabling reproducible benchmarking and direct comparison of methods across datasets and models.