🤖 AI Summary
This study addresses a critical gap in the evaluation of explainability for medical image classification models, which has predominantly emphasized localization accuracy while overlooking whether models employ consistent spatial reasoning strategies across pathologically similar samples. To bridge this gap, we introduce C-Score (Consistency Score), a novel, annotation-free metric that quantifies intra-class consistency of Class Activation Map (CAM) explanations using confidence-weighted soft Intersection-over-Union with intensity emphasis. Through transfer learning experiments on the Kermany chest X-ray dataset—combining six CAM variants (including Grad-CAM) with DenseNet201, InceptionV3, and ResNet50V2—we uncover three distinct mechanisms by which explanation consistency decouples from AUC performance. Notably, C-Score detects ScoreCAM degradation one checkpoint before a significant AUC drop, offering an early, explanation-quality-based warning signal to inform clinical model selection and deployment.
📝 Abstract
Class Activation Mapping (CAM) methods are widely used to generate visual explanations for deep learning classifiers in medical imaging. However, existing evaluation frameworks assess whether explanations are correct, measured by localisation fidelity against radiologist annotations, rather than whether they are consistent: whether the model applies the same spatial reasoning strategy across different patients with the same pathology. We propose the C-Score (Consistency Score), a confidence-weighted, annotation-free metric that quantifies intra-class explanation reproducibility via intensity-emphasised pairwise soft IoU across correctly classified instances. We evaluate six CAM techniques: GradCAM, GradCAM++, LayerCAM, EigenCAM, ScoreCAM, and MS GradCAM++ across three CNN architectures (DenseNet201, InceptionV3, ResNet50V2) over thirty training epochs on the Kermany chest X-ray dataset, covering transfer learning and fine-tuning phases. We identify three distinct mechanisms of AUC-consistency dissociation, invisible to standard classification metrics: threshold-mediated gold list collapse, technique-specific attribution collapse at peak AUC, and class-level consistency masking in global aggregation. C-Score provides an early warning signal of impending model instability. ScoreCAM deterioration on ResNet50V2 is detectable one full checkpoint before catastrophic AUC collapse and yields architecture-specific clinical deployment recommendations grounded in explanation quality rather than predictive ranking alone.