Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

๐Ÿ“… 2026-04-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

174K/year
๐Ÿค– AI Summary
This work addresses the frequent lack of cognitive faithfulness in explanations generated by large language modelsโ€”namely, their failure to accurately reflect the internal reasoning underlying model decisions. To bridge this gap, the authors propose a training-free, attention-level intervention that, for the first time, incorporates token-level attribution heatmaps into the explanation generation process. By steering the modelโ€™s focus toward key evidential tokens, this approach enhances the cognitive faithfulness of generated explanations. Evaluated across diverse models, benchmarks, and prompting configurations, the method significantly improves explanation quality when combined with off-the-shelf large language models and counterfactual evaluation, effectively reconciling subjective plausibility with cognitive faithfulness and demonstrating strong generality and efficacy.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.
Problem

Research questions and friction points this paper is trying to address.

faithfulness
explanation
large language models
attribution
interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

epistemic faithfulness
attribution guidance
attention intervention
post-hoc explanation
faithful explanation
๐Ÿ”Ž Similar Papers