🤖 AI Summary
This work addresses the challenge of efficiently selecting a small, highly interpretable subset of training examples from massive datasets to elucidate the prediction behavior of language models. The authors propose a novel relevance scoring metric that quantifies the explanatory power of any training subset with respect to model outputs—without requiring model retraining—and leverage this metric to design an example selection strategy that jointly optimizes for influence and representativeness. Experimental results demonstrate that the proposed score effectively predicts whether a given subset supports or undermines specific model predictions. Moreover, the selection strategy significantly outperforms existing baselines, revealing that some commonly used approaches can perform worse than random selection.
📝 Abstract
Training data influence estimation methods quantify the contribution of training documents to a model's output, making them a promising source of information for example-based explanations. As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation. Although the choice of which documents to include directly affects explanation quality, previous evaluations of such systems have largely ignored any selection strategies. To address this, we propose a novel selection relevance score, a retraining-free metric that quantifies how useful a set of examples is for explaining a model's output. We validate this score through fine-tuning experiments, confirming that it can predict whether a set of examples supports or undermines the model's predictions. Using this metric, we further show that common selection strategies often underperform random selection. Motivated by this finding, we propose a strategy that balances influence and representativeness, enabling better use of selection budgets than naively selecting the highest-ranking examples.