Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey

📅 2023-06-06
🏛️ Information Fusion
📈 Citations: 74
Influential: 7
📄 PDF
🤖 AI Summary
Existing eXplainable Artificial Intelligence (XAI) methods exhibit vulnerability to adversarial attacks in high-stakes applications, compromising explanation fidelity, system trustworthiness, and operational safety. Method: We conduct a systematic literature review of over 120 papers and propose the first unified taxonomic framework for XAI adversarial attacks and defenses, elucidating the coupled fragility between explanations and underlying models. We introduce a multidimensional robustness evaluation metric suite and comprehensively categorize seven canonical attack types—including gradient-based, mask-based, and surrogate-model attacks—and five defense paradigms: explanation regularization, adversarial training, causal intervention, and trustworthy explanation generation. Contribution/Results: Our work establishes a foundational theoretical framework and actionable design principles for developing secure, robust XAI systems, bridging critical gaps between interpretability, robustness, and safety in AI deployment.
Problem

Research questions and friction points this paper is trying to address.

Explores vulnerabilities of XAI methods to adversarial attacks
Investigates security risks in model explanations and fairness metrics
Seeks defenses against attacks to ensure robust interpretation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey adversarial attacks on XAI explanations
Unified taxonomy for AdvML and XAI research
Defense strategies for robust XAI methods