A protocol for evaluating robustness to H&E staining variation in computational pathology models

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Computational pathology models often exhibit unstable performance under varying H&E staining conditions across laboratories, and there is a lack of systematic methods to evaluate their robustness. This work proposes the first standardized protocol for assessing staining robustness, comprising three steps: defining a reference staining condition, quantifying staining characteristics of test sets, and evaluating model performance under simulated reference conditions. Leveraging the PLISM dataset to construct a reference staining library, the study integrates staining attribute quantification, staining simulation, and multiple instance learning—using UNI2-h, H-Optimus-1, and Virchow2 feature extractors—to evaluate 306 MSI classification models on 738 colorectal cancer cases. Results show AUCs ranging from 0.769 to 0.911 and robustness scores between 0.007 and 0.079, revealing a weak negative correlation between performance and robustness, thereby supporting robustness-informed model selection and deployment validation.

Technology Category

Application Category

📝 Abstract

Sensitivity to staining variation remains a major barrier to deploying computational pathology (CPath) models as hematoxylin and eosin (H&E) staining varies across laboratories, requiring systematic assessment of how this variability affects model prediction. In this work, we developed a three-step protocol for evaluating robustness to H&E staining variation in CPath models. Step 1: Select reference staining conditions, Step 2: Characterize test set staining properties, Step 3: Apply CPath model(s) under simulated reference staining conditions. Here, we first created a new reference staining library based on the PLISM dataset. As an exemplary use case, we applied the protocol to assess the robustness properties of 306 microsatellite instability (MSI) classification models on the unseen SurGen colorectal cancer dataset (n=738), including 300 attention-based multiple instance learning models trained on the TCGA-COAD/READ datasets across three feature extractors (UNI2-h, H-Optimus-1, Virchow2), alongside six public MSI classification models. Classification performance was measured as AUC, and robustness as the min-max AUC range across four simulated staining conditions (low/high H&E intensity, low/high H&E color similarity). Across models and staining conditions, classification performance ranged from AUC 0.769-0.911 ($Δ$ = 0.142). Robustness ranged from 0.007-0.079 ($Δ$ = 0.072), and showed a weak inverse correlation with classification performance (Pearson r=-0.22, 95% CI [-0.34, -0.11]). Thus, we show that the proposed evaluation protocol enables robustness-informed CPath model selection and provides insight into performance shifts across H&E staining conditions, supporting the identification of operational ranges for reliable model deployment. Code is available at https://github.com/CTPLab/staining-robustness-evaluation .

Problem

Research questions and friction points this paper is trying to address.

computational pathology

H&E staining variation

model robustness

staining sensitivity

MSI classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

staining robustness

computational pathology

H&E variation