Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics

📅 2024-09-25
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The eXplainable AI (XAI) field has long suffered from a lack of systematic evaluation frameworks, hindering principled method selection. To address this, we introduce LATEC, the first large-scale, multi-dimensionally controllable XAI benchmark—systematically varying model architectures, input modalities, and evaluation metrics across 17 explanation methods, 20 metrics, and 7,560 architecture-modality combinations. Our analysis reveals substantial metric incompatibility, undermining the reliability of conventional single-metric rankings; notably, Expected Gradients emerges as uniquely robust and superior under diverse multimodal and architectural settings. We publicly release 326k saliency maps and 378k metadata evaluation scores, accompanied by a task-aware XAI method selection guide. The core contributions are (1) a novel multi-dimensional controlled-evaluation paradigm and (2) a principled framework for analyzing metric consistency and interdependence.

Technology Category

Application Category

📝 Abstract
Explainable AI (XAI) is a rapidly growing domain with a myriad of proposed methods as well as metrics aiming to evaluate their efficacy. However, current studies are often of limited scope, examining only a handful of XAI methods and ignoring underlying design parameters for performance, such as the model architecture or the nature of input data. Moreover, they often rely on one or a few metrics and neglect thorough validation, increasing the risk of selection bias and ignoring discrepancies among metrics. These shortcomings leave practitioners confused about which method to choose for their problem. In response, we introduce LATEC, a large-scale benchmark that critically evaluates 17 prominent XAI methods using 20 distinct metrics. We systematically incorporate vital design parameters like varied architectures and diverse input modalities, resulting in 7,560 examined combinations. Through LATEC, we showcase the high risk of conflicting metrics leading to unreliable rankings and consequently propose a more robust evaluation scheme. Further, we comprehensively evaluate various XAI methods to assist practitioners in selecting appropriate methods aligning with their needs. Curiously, the emerging top-performing method, Expected Gradients, is not examined in any relevant related study. LATEC reinforces its role in future XAI research by publicly releasing all 326k saliency maps and 378k metric scores as a (meta-)evaluation dataset. The benchmark is hosted at: https://github.com/IML-DKFZ/latec.
Problem

Research questions and friction points this paper is trying to address.

Interpretable AI
Evaluation Framework
Uncertainty in Selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

LATEC
XAI Evaluation
Expected Gradients
Lukas Klein
Lukas Klein
EPFL, USZ
Machine LearningBiotechComputer Vision
Carsten T. Lüth
Carsten T. Lüth
PhD Student @ Interactive Machine Learning Research Group
Label Efficient Training of Deep Learning Models
Udo Schlegel
Udo Schlegel
PostDoc, LMU München
Explainable AIDeep LearningVisual AnalyticsTime Series Analysis
T
Till J. Bungert
German Cancer Research Center (DKFZ), Interactive Machine Learning Group, Germany; Helmholtz Imaging, German Cancer Research Center (DKFZ), Germany; Heidelberg University, Department of Computer Science, Germany
Mennatallah El-Assady
Mennatallah El-Assady
ETH Zürich
VisualizationIntelligence AugmentationXAIInteractive Machine LearningNatural Language
P
Paul F. Jäger
German Cancer Research Center (DKFZ), Interactive Machine Learning Group, Germany; Helmholtz Imaging, German Cancer Research Center (DKFZ), Germany