Training Feature Attribution for Vision Models

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Interpretability of deep neural networks requires joint consideration of input features and training samples. This paper introduces “training-feature attribution,” a novel paradigm that, for the first time, attributes model predictions for test samples to localized regions (e.g., image patches) within specific training images—thereby jointly modeling interactions between input space and training data space. Methodologically, we integrate influence function estimation with pixel-level sensitivity analysis to enable end-to-end, joint attribution over both training instances and input features. Evaluated on vision tasks, our approach effectively identifies adversarial training regions that induce misclassification and latent spurious shortcut features, outperforming conventional attribution methods that focus exclusively on either inputs or training samples. The framework provides a principled tool for model diagnosis and trustworthy AI development.

Technology Category

Application Category

📝 Abstract

Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.

Problem

Research questions and friction points this paper is trying to address.

Linking test predictions to specific training image regions

Identifying harmful training examples causing model misclassifications

Revealing spurious correlations missed by conventional attribution methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Links test predictions to specific training image regions

Identifies harmful examples causing model misclassifications

Reveals spurious correlations missed by conventional methods

🔎 Similar Papers

No similar papers found.