Evaluating the Explainability of Vision Transformers in Medical Imaging

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Vision Transformers (ViTs) suffer from limited interpretability in medical imaging, hindering clinical trust and adoption. Method: We systematically evaluate the impact of architecture (ViT, DeiT, DINO, Swin Transformer) and pretraining strategy on explanation quality for blood cell and breast ultrasound classification tasks, and quantitatively and qualitatively compare Gradient Attention Rollout with Grad-CAM. Contribution/Results: DINO-pretrained models paired with Grad-CAM yield heatmaps exhibiting superior class discriminability and spatial localization accuracy—even for misclassified samples—consistently highlighting clinically relevant morphological features. This combination significantly outperforms alternatives across multiple datasets, establishing a verifiable, optimization-guided pathway toward trustworthy ViT deployment in clinical AI.

Technology Category

Application Category

📝 Abstract

Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Vision Transformers' explainability in medical imaging tasks

Comparing explainability methods for different ViT architectures and strategies

Improving model transparency for reliable clinical diagnostic integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated ViT explainability using Gradient Attention Rollout

Applied Grad-CAM for class-discriminative heatmap generation

DINO with Grad-CAM provided most faithful medical explanations

🔎 Similar Papers

T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers