Improving Adversarial Robustness of Attribution via Implicit Regularization

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing attribution methods suffer from insufficient robustness under adversarial perturbations and rely on explicit regularization strategies that incur high computational overhead. This work demonstrates that the intrinsic learning dynamics of standard stochastic gradient descent (SGD) implicitly enhance attribution robustness without any additional computational cost. Theoretically, we establish—for the first time—a critical link between curvature in parameter space and input space as the underlying mechanism driving this implicit regularization. Furthermore, we identify that Softmax-based attention limits robustness gains due to its inherent entropy constraints and propose a kernelized attention alternative to recover this lost robustness. Extensive experiments across diverse architectures, datasets, and attribution methods confirm the effectiveness of our approach, achieving significant improvements in attribution robustness at nearly zero extra cost, with consistent gains successfully replicated in Transformer models.

📝 Abstract

The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameter-space and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.

Problem

Research questions and friction points this paper is trying to address.

adversarial robustness

attribution

implicit regularization

attention mechanism

softmax normalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

implicit regularization

attribution robustness

learning dynamics