🤖 AI Summary
Current evaluations of feature attribution robustness conflate changes in model outputs with instability in attribution maps, thereby obscuring inherent flaws in attribution methods. Method: The authors redefine attribution robustness as the stability of attribution maps under semantically similar inputs—specifically, adversarial perturbations generated by GANs—while permitting reasonable variations in model predictions. They propose a novel robustness metric based on structural similarity between attribution maps. Results: Under this refined framework, mainstream attribution methods—including Grad-CAM and Integrated Gradients—exhibit significantly reduced robustness scores, exposing their intrinsic instability. This work establishes a more objective and interpretable evaluation paradigm for attribution robustness, providing a principled benchmark for verifying the reliability of explainable AI systems.
📝 Abstract
This paper studies the robustness of feature attribution methods for deep neural networks. It challenges the current notion of attributional robustness that largely ignores the difference in the model's outputs and introduces a new way of evaluating the robustness of attribution methods. Specifically, we propose a new definition of similar inputs, a new robustness metric, and a novel method based on generative adversarial networks to generate these inputs. In addition, we present a comprehensive evaluation with existing metrics and state-of-the-art attribution methods. Our findings highlight the need for a more objective metric that reveals the weaknesses of an attribution method rather than that of the neural network, thus providing a more accurate evaluation of the robustness of attribution methods.