Explanation-Guided Adversarial Training for Robust and Interpretable Models

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the instability of deep neural networks in both predictions and explanations under adversarial perturbations or out-of-distribution data, a challenge that existing methods struggle to reconcile with robustness, accuracy, and interpretability. The paper proposes the Explanation-Guided Adversarial Training (EGAT) framework, which unifies explanation-guided learning with adversarial training for the first time by incorporating attribution constraints during adversarial example generation. EGAT jointly optimizes classification performance, adversarial robustness, and explanation stability. Theoretical analysis grounded in the PAC learning framework demonstrates that the model yields more reliable predictions under unexpected conditions. Empirical results show that EGAT improves both clean and adversarial accuracy by over 37% on out-of-distribution benchmarks compared to baselines, significantly enhances the semantic plausibility of explanations, and incurs only a 16% increase in training overhead.

Technology Category

Application Category

📝 Abstract

Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial examples on the fly while imposing explanation-based constraints on the model. By jointly optimizing classification performance, adversarial robustness, and attributional stability, EGAT is not only more resistant to unexpected cases, including adversarial attacks and out-of-distribution (OOD) scenarios, but also offer human-interpretable justifications for the decisions. We further formalize EGAT within the Probably Approximately Correct learning framework, demonstrating theoretically that it yields more stable predictions under unexpected situations compared to standard AT. Empirical evaluations on OOD benchmark datasets show that EGAT consistently outperforms competitive baselines in both clean accuracy and adversarial accuracy +37% while producing more semantically meaningful explanations, and requiring only a limited increase +16% in training time.

Problem

Research questions and friction points this paper is trying to address.

adversarial robustness

interpretability

saliency maps

out-of-distribution

explanation stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explanation-Guided Learning

Adversarial Training

Attributional Stability