🤖 AI Summary
Existing pixel-wise attribution methods exhibit high sensitivity to minor input perturbations, yielding unstable attribution maps despite invariant model predictions—severely undermining trustworthiness. To address this, we propose the first provably robust certification framework for arbitrary black-box attribution methods at the pixel level. Our approach leverages randomized smoothing to formulate attribution robustness verification as a binary segmentation problem and introduces three principled evaluation metrics: certified robustness, localization accuracy, and fidelity. Key technical components include attribution map sparsification and smoothing, coupled with deterministic ℓ₂-perturbation certification. We conduct comprehensive experiments across five ImageNet models and twelve attribution methods. Results demonstrate that our certified attribution maps achieve strong robustness guarantees while preserving interpretability and high fidelity—enabling reliable downstream applications in trustworthy AI.
📝 Abstract
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at https://github.com/AlaaAnani/certified-attributions.