đ¤ AI Summary
This work addresses the lack of systematic evaluation of explanation methods in multiple instance learning (MIL) models widely used in computational pathology, which commonly rely on attention heatmaps for interpretability. The authors propose a general, annotation-free framework to comprehensively benchmark various explanation techniquesâincluding Layer-wise Relevance Propagation (LRP), Integrated Gradients, and Single Perturbationâacross classification, regression, and survival analysis tasks under diverse architectures such as Attention, Transformer, and Mamba. Their large-scale evaluation reveals that both model architecture and task type significantly influence explanation quality, with LRP, Integrated Gradients, and Single Perturbation consistently outperforming conventional attention heatmaps. Furthermore, the high-performing heatmaps are correlated with spatial transcriptomics data to validate their biological relevance and uncover divergent decision strategies among models in predicting HPV infection status.
đ Abstract
Multiple instance learning (MIL) has enabled substantial progress in computational histopathology, where a large amount of patches from gigapixel whole slide images are aggregated into slide-level predictions. Heatmaps are widely used to validate MIL models and to discover tissue biomarkers. Yet, the validity of these heatmaps has barely been investigated. In this work, we introduce a general framework for evaluating the quality of MIL heatmaps without requiring additional labels. We conduct a large-scale benchmark experiment to assess six explanation methods across histopathology task types (classification, regression, survival), MIL model architectures (Attention-, Transformer-, Mamba-based), and patch encoder backbones (UNI2, Virchow2). Our results show that explanation quality mostly depends on MIL model architecture and task type, with perturbation ("Single"), layer-wise relevance propagation (LRP), and integrated gradients (IG) consistently outperforming attention-based and gradient-based saliency heatmaps, which often fail to reflect model decision mechanisms. We further demonstrate the advanced capabilities of the best-performing explanation methods: (i) We provide a proof-of-concept that MIL heatmaps of a bulk gene expression prediction model can be correlated with spatial transcriptomics for biological validation, and (ii) showcase the discovery of distinct model strategies for predicting human papillomavirus (HPV) infection from head and neck cancer slides. Our work highlights the importance of validating MIL heatmaps and establishes that improved explainability can enable more reliable model validation and yield biological insights, making a case for a broader adoption of explainable AI in digital pathology. Our code is provided in a public GitHub repository: https://github.com/bifold-pathomics/xMIL/tree/xmil-journal