🤖 AI Summary
This work addresses the problem of efficiently identifying the key training samples underlying a model’s predictions to enhance interpretability and safety. It formulates data attribution as a Bayesian information-theoretic problem, using the increase in predictive entropy—i.e., information loss—induced by removing a sample as the attribution criterion, thereby prioritizing the reduction of prediction uncertainty over fitting label noise. The approach leverages Gaussian process surrogates and tangent features for efficient approximation and introduces a scalable information gain objective coupled with a variance correction mechanism, enabling compatibility with large-scale vector database retrieval. Empirically, the method demonstrates strong performance across counterfactual sensitivity, ground-truth attribution retrieval, and coreset selection tasks, offering both theoretical rigor and scalability to modern deep architectures.
📝 Abstract
Training Data Attribution (TDA) seeks to trace model predictions back to influential training examples, enhancing interpretability and safety. We formulate TDA as a Bayesian information-theoretic problem: subsets are scored by the information loss they induce - the entropy increase at a query when removed. This criterion credits examples for resolving predictive uncertainty rather than label noise. To scale to modern networks, we approximate information loss using a Gaussian Process surrogate built from tangent features. We show this aligns with classical influence scores for single-example attribution while promoting diversity for subsets. For even larger-scale retrieval, we relax to an information-gain objective and add a variance correction for scalable attribution in vector databases. Experiments show competitive performance on counterfactual sensitivity, ground-truth retrieval and coreset selection, showing that our method scales to modern architectures while bridging principled measures with practice.