Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

📅 2025-04-19

📈 Citations: 2

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Large language models (LLMs) frequently generate explanations lacking faithfulness—i.e., concepts cited as decisive in the explanation often lack genuine causal influence on model predictions, risking misplaced trust and misuse. To address this, we propose the first quantitative framework for explanation faithfulness grounded in causal concept intervention. We formally define faithfulness as the alignment between concepts highlighted in an explanation and their empirically estimated causal effects on model outputs. Our method leverages an auxiliary LLM to generate semantically coherent counterfactual inputs and employs Bayesian hierarchical modeling to estimate concept-level causal effects robustly across samples and datasets. Evaluated on social bias detection and medical question-answering tasks, our approach successfully identifies prevalent unfaithful patterns—including “explanation-masking bias” and “fabricated-evidence dependence”—and delivers an interpretable, quantifiable diagnostic tool for faithfulness assessment.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's"reasoning"process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.

Problem

Research questions and friction points this paper is trying to address.

Measuring faithfulness of LLM explanations to prevent over-trust

Defining faithfulness via influential concept discrepancy in explanations

Detecting misleading claims in LLM explanations on bias and evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defining faithfulness via concept influence difference

Creating counterfactuals with auxiliary LLM modifications

Quantifying causal effects via Bayesian hierarchical model

🔎 Similar Papers

FaithLM: Towards Faithful Explanations for Large Language Models