Noiser: Bounded Input Perturbations for Attributing Large Language Models

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the dual challenges of enhancing faithfulness and semantic sufficiency in feature attribution (FA) for large language models (LLMs). To this end, we propose Noiser—a novel FA method that injects bounded Gaussian noise into input token embeddings and quantifies local model robustness via perturbation sensitivity to generate faithful attributions. Crucially, we introduce *answerability*—a new evaluation dimension measuring whether an attribution-masked prompt can still recover the original model output—assessed automatically using an instruction-tuned discriminator. Our contributions include: (i) the first bounded embedding perturbation framework for FA, and (ii) a comprehensive multi-model, multi-task evaluation benchmark. Experiments span six mainstream LLMs across three NLP tasks; Noiser consistently outperforms gradient-based, attention-based, and existing perturbation-based methods on both faithfulness and answerability metrics.

Technology Category

Application Category

📝 Abstract

Feature attribution (FA) methods are common post-hoc approaches that explain how Large Language Models (LLMs) make predictions. Accordingly, generating faithful attributions that reflect the actual inner behavior of the model is crucial. In this paper, we introduce Noiser, a perturbation-based FA method that imposes bounded noise on each input embedding and measures the robustness of the model against partially noised input to obtain the input attributions. Additionally, we propose an answerability metric that employs an instructed judge model to assess the extent to which highly scored tokens suffice to recover the predicted output. Through a comprehensive evaluation across six LLMs and three tasks, we demonstrate that Noiser consistently outperforms existing gradient-based, attention-based, and perturbation-based FA methods in terms of both faithfulness and answerability, making it a robust and effective approach for explaining language model predictions.

Problem

Research questions and friction points this paper is trying to address.

Develop faithful feature attribution for LLM predictions

Propose bounded noise perturbation for input attributions

Evaluate answerability using judge model on token scores

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bounded noise on input embeddings

Answerability metric with judge model

Outperforms existing FA methods

🔎 Similar Papers

No similar papers found.

Authors to Follow