Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

📅 2024-08-21

🏛️ BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

📈 Citations: 1

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing faithfulness evaluation methods for autoregressive language models suffer from distributional shift induced by input perturbations, undermining reliability. Method: We propose the first evaluation framework based on controllable counterfactual generation, leveraging causal intervention modeling and fluent counterfactual text generation to ensure all test instances strictly reside within the model’s training distribution—thereby enabling distributionally consistent faithfulness assessment. Contribution/Results: By integrating counterfactual generation with attribution sensitivity analysis, our approach avoids biases inherent in conventional perturbation-based methods. Experiments on GPT-2, LLaMA, and other mainstream models reveal systematic overestimation of faithfulness by existing attribution techniques. In contrast, our framework yields more accurate and trustworthy performance estimates, establishing a novel paradigm for credible attribution evaluation.

Technology Category

Application Category

📝 Abstract

Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models. Evaluating the faithfulness of an explanation method—how accurately it explains the inner workings and decision-making of the model—is challenging because it is difficult to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove input tokens deemed important by a particular attribution (feature importance) method and observe the resulting change in the model’s output. However, for autoregressive language models, this approach creates out-of-distribution inputs due to their next-token prediction training objective. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language models. Our technique generates fluent, in-distribution counterfactuals, making the evaluation protocol more reliable.

Problem

Research questions and friction points this paper is trying to address.

Evaluate faithfulness of attribution methods in autoregressive models.

Address out-of-distribution input issues in faithfulness evaluation.

Propose counterfactual generation for reliable attribution method evaluation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses counterfactual generation for evaluation

Focuses on autoregressive language models

Generates fluent, in-distribution counterfactuals

🔎 Similar Papers

Whose LLM is it Anyway? Linguistic Comparison and LLM Attribution for GPT-3.5, GPT-4 and Bard