Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the frequent lack of cognitive faithfulness in explanations generated by large language models—namely, their failure to accurately reflect the internal reasoning underlying model decisions. To bridge this gap, the authors propose a training-free, attention-level intervention that, for the first time, incorporates token-level attribution heatmaps into the explanation generation process. By steering the model’s focus toward key evidential tokens, this approach enhances the cognitive faithfulness of generated explanations. Evaluated across diverse models, benchmarks, and prompting configurations, the method significantly improves explanation quality when combined with off-the-shelf large language models and counterfactual evaluation, effectively reconciling subjective plausibility with cognitive faithfulness and demonstrating strong generality and efficacy.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.

Problem

Research questions and friction points this paper is trying to address.

faithfulness

explanation

large language models

attribution

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

epistemic faithfulness

attribution guidance

attention intervention