Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks

📅 2026-01-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work proposes an interpretability-guided adversarial training framework to address the tension between robustness and interpretability in deep learning models deployed in safety-critical settings. For the first time, LIME-based feature attribution is directly integrated into the training process, where feature masking and sensitivity-aware regularization actively suppress spurious or unstable features identified by the explainer—without requiring additional data or architectural modifications. Theoretical analysis establishes a lower bound linking alignment of feature attributions to model robustness. Extensive experiments demonstrate that the proposed method significantly enhances both adversarial robustness and out-of-distribution generalization on CIFAR-10, CIFAR-10-C, and CIFAR-100 benchmarks.

Technology Category

Application Category

📝 Abstract

The growing reliance on deep learning models in safety-critical domains such as healthcare and autonomous navigation underscores the need for defenses that are both robust to adversarial perturbations and transparent in their decision-making. In this paper, we identify a connection between interpretability and robustness that can be directly leveraged during training. Specifically, we observe that spurious, unstable, or semantically irrelevant features identified through Local Interpretable Model-Agnostic Explanations (LIME) contribute disproportionately to adversarial vulnerability. Building on this insight, we introduce an attribution-guided refinement framework that transforms LIME from a passive diagnostic into an active training signal. Our method systematically suppresses spurious features using feature masking, sensitivity-aware regularization, and adversarial augmentation in a closed-loop refinement pipeline. This approach does not require additional datasets or model architectures and integrates seamlessly into standard adversarial training. Theoretically, we derive an attribution-aware lower bound on adversarial distortion that formalizes the link between explanation alignment and robustness. Empirical evaluations on CIFAR-10, CIFAR-10-C, and CIFAR-100 demonstrate substantial improvements in adversarial robustness and out-of-distribution generalization.

Problem

Research questions and friction points this paper is trying to address.

adversarial attacks

model robustness

explainability

interpretability

deep learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainability-Guided Defense

Attribution-Aware Refinement

Adversarial Robustness