๐ค AI Summary
This work addresses the frequent lack of cognitive faithfulness in explanations generated by large language modelsโnamely, their failure to accurately reflect the internal reasoning underlying model decisions. To bridge this gap, the authors propose a training-free, attention-level intervention that, for the first time, incorporates token-level attribution heatmaps into the explanation generation process. By steering the modelโs focus toward key evidential tokens, this approach enhances the cognitive faithfulness of generated explanations. Evaluated across diverse models, benchmarks, and prompting configurations, the method significantly improves explanation quality when combined with off-the-shelf large language models and counterfactual evaluation, effectively reconciling subjective plausibility with cognitive faithfulness and demonstrating strong generality and efficacy.
๐ Abstract
Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.