🤖 AI Summary
Existing eXplainable Artificial Intelligence (XAI) methods exhibit vulnerability to adversarial attacks in high-stakes applications, compromising explanation fidelity, system trustworthiness, and operational safety.
Method: We conduct a systematic literature review of over 120 papers and propose the first unified taxonomic framework for XAI adversarial attacks and defenses, elucidating the coupled fragility between explanations and underlying models. We introduce a multidimensional robustness evaluation metric suite and comprehensively categorize seven canonical attack types—including gradient-based, mask-based, and surrogate-model attacks—and five defense paradigms: explanation regularization, adversarial training, causal intervention, and trustworthy explanation generation.
Contribution/Results: Our work establishes a foundational theoretical framework and actionable design principles for developing secure, robust XAI systems, bridging critical gaps between interpretability, robustness, and safety in AI deployment.