How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work reveals that deep learning posterior attribution methods—such as Grad-CAM and Integrated Gradients—are highly sensitive to fine-grained training details of the classifier head, challenging the prevailing assumption that attributions are invariant to classifier training. Through systematic probing experiments across supervised, self-supervised, and contrastive vision-language pretraining paradigms, we demonstrate that fine-tuning strategies for the classifier layer (accounting for <10% of total parameters) exert a stronger influence on attribution quality than the pretraining paradigm itself. To address this, we propose a lightweight, plug-and-play classifier-head adaptation scheme that requires no backbone retraining. Evaluated across multiple fidelity and consistency metrics, our approach delivers consistent improvements across diverse pretraining frameworks and attribution methods. The results establish a more robust and practical pathway for interpretable AI, emphasizing the critical role of classifier-layer design in attribution reliability.

Technology Category

Application Category

📝 Abstract

Post-hoc importance attribution methods are a popular tool for"explaining"Deep Neural Networks (DNNs) and are inherently based on the assumption that the explanations can be applied independently of how the models were trained. Contrarily, in this work we bring forward empirical evidence that challenges this very notion. Surprisingly, we discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer (less than 10 percent of model parameters) play a crucial role, much more than the pre-training scheme itself. This is of high practical relevance: (1) as techniques for pre-training models are becoming increasingly diverse, understanding the interplay between these techniques and attribution methods is critical; (2) it sheds light on an important yet overlooked assumption of post-hoc attribution methods which can drastically impact model explanations and how they are interpreted eventually. With this finding we also present simple yet effective adjustments to the classification layers, that can significantly enhance the quality of model explanations. We validate our findings across several visual pre-training frameworks (fully-supervised, self-supervised, contrastive vision-language training) and analyse how they impact explanations for a wide range of attribution methods on a diverse set of evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

Challenges independence assumption of post-hoc explanation methods

Highlights training details' impact on model explanations

Proposes adjustments to improve explanation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training details crucial for model explanations

Adjustments to classification layers enhance explanations

Empirical evidence challenges independent explanation assumption

🔎 Similar Papers

The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective