Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks

๐Ÿ“… 2026-01-02
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes an interpretability-guided adversarial training framework to address the tension between robustness and interpretability in deep learning models deployed in safety-critical settings. For the first time, LIME-based feature attribution is directly integrated into the training process, where feature masking and sensitivity-aware regularization actively suppress spurious or unstable features identified by the explainerโ€”without requiring additional data or architectural modifications. Theoretical analysis establishes a lower bound linking alignment of feature attributions to model robustness. Extensive experiments demonstrate that the proposed method significantly enhances both adversarial robustness and out-of-distribution generalization on CIFAR-10, CIFAR-10-C, and CIFAR-100 benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
The growing reliance on deep learning models in safety-critical domains such as healthcare and autonomous navigation underscores the need for defenses that are both robust to adversarial perturbations and transparent in their decision-making. In this paper, we identify a connection between interpretability and robustness that can be directly leveraged during training. Specifically, we observe that spurious, unstable, or semantically irrelevant features identified through Local Interpretable Model-Agnostic Explanations (LIME) contribute disproportionately to adversarial vulnerability. Building on this insight, we introduce an attribution-guided refinement framework that transforms LIME from a passive diagnostic into an active training signal. Our method systematically suppresses spurious features using feature masking, sensitivity-aware regularization, and adversarial augmentation in a closed-loop refinement pipeline. This approach does not require additional datasets or model architectures and integrates seamlessly into standard adversarial training. Theoretically, we derive an attribution-aware lower bound on adversarial distortion that formalizes the link between explanation alignment and robustness. Empirical evaluations on CIFAR-10, CIFAR-10-C, and CIFAR-100 demonstrate substantial improvements in adversarial robustness and out-of-distribution generalization.
Problem

Research questions and friction points this paper is trying to address.

adversarial attacks
model robustness
explainability
interpretability
deep learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainability-Guided Defense
Attribution-Aware Refinement
Adversarial Robustness
LIME
Feature Suppression
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Longwei Wang
AI Lab, Department of Computer Science, University of South Dakota, Vermillion, SD, USA
M
Mohammad Navid Nayyem
AI Lab, Department of Computer Science, University of South Dakota, Vermillion, SD, USA
A
Abdullah Al Rakin
AI Lab, Department of Computer Science, University of South Dakota, Vermillion, SD, USA
K
K. Santosh
AI Lab, Department of Computer Science, University of South Dakota, Vermillion, SD, USA
Chaowei Zhang
Chaowei Zhang
Department of Computer Science at Yangzhou University
Natural Language ProcessingData MiningParallel Computing
Yang Zhou
Yang Zhou
Auburn University
MLAINLPSecurity & PrivacySystems