🤖 AI Summary
This work addresses the dual challenges of enhancing faithfulness and semantic sufficiency in feature attribution (FA) for large language models (LLMs). To this end, we propose Noiser—a novel FA method that injects bounded Gaussian noise into input token embeddings and quantifies local model robustness via perturbation sensitivity to generate faithful attributions. Crucially, we introduce *answerability*—a new evaluation dimension measuring whether an attribution-masked prompt can still recover the original model output—assessed automatically using an instruction-tuned discriminator. Our contributions include: (i) the first bounded embedding perturbation framework for FA, and (ii) a comprehensive multi-model, multi-task evaluation benchmark. Experiments span six mainstream LLMs across three NLP tasks; Noiser consistently outperforms gradient-based, attention-based, and existing perturbation-based methods on both faithfulness and answerability metrics.
📝 Abstract
Feature attribution (FA) methods are common post-hoc approaches that explain how Large Language Models (LLMs) make predictions. Accordingly, generating faithful attributions that reflect the actual inner behavior of the model is crucial. In this paper, we introduce Noiser, a perturbation-based FA method that imposes bounded noise on each input embedding and measures the robustness of the model against partially noised input to obtain the input attributions. Additionally, we propose an answerability metric that employs an instructed judge model to assess the extent to which highly scored tokens suffice to recover the predicted output. Through a comprehensive evaluation across six LLMs and three tasks, we demonstrate that Noiser consistently outperforms existing gradient-based, attention-based, and perturbation-based FA methods in terms of both faithfulness and answerability, making it a robust and effective approach for explaining language model predictions.