🤖 AI Summary
Interpretability of deep neural networks requires joint consideration of input features and training samples. This paper introduces “training-feature attribution,” a novel paradigm that, for the first time, attributes model predictions for test samples to localized regions (e.g., image patches) within specific training images—thereby jointly modeling interactions between input space and training data space. Methodologically, we integrate influence function estimation with pixel-level sensitivity analysis to enable end-to-end, joint attribution over both training instances and input features. Evaluated on vision tasks, our approach effectively identifies adversarial training regions that induce misclassification and latent spurious shortcut features, outperforming conventional attribution methods that focus exclusively on either inputs or training samples. The framework provides a principled tool for model diagnosis and trustworthy AI development.
📝 Abstract
Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.