🤖 AI Summary
Generative AI models suffer from limited output interpretability due to their black-box nature, posing trust and compliance risks—particularly in art and copyright-sensitive domains. To address this, we propose a search-driven data influence attribution method that reverse-traces the dependency of generated outputs on training data—including both raw samples and latent-space embeddings—enabling output-oriented interpretability analysis. Unlike conventional gradient- or perturbation-based approaches, our method innovatively anchors attribution at the generation outcome and unifies influence assessment across both original data and latent representations. It employs efficient search optimization coupled with local retraining for rigorous validation, enabling precise identification of critical training subsets. Experiments demonstrate strong cross-model generalization and significantly enhance the feasibility and reliability of expert-guided interpretability evaluation.
📝 Abstract
Generative AI models offer powerful capabilities but often lack transparency, making it difficult to interpret their output. This is critical in cases involving artistic or copyrighted content. This work introduces a search-inspired approach to improve the interpretability of these models by analysing the influence of training data on their outputs. Our method provides observational interpretability by focusing on a model's output rather than on its internal state. We consider both raw data and latent-space embeddings when searching for the influence of data items in generated content. We evaluate our method by retraining models locally and by demonstrating the method's ability to uncover influential subsets in the training data. This work lays the groundwork for future extensions, including user-based evaluations with domain experts, which is expected to improve observational interpretability further.