🤖 AI Summary
Post-hoc explanations of large language models (LLMs) frequently misalign with their actual decision-making logic, yielding distorted feature attributions; existing evaluation methods suffer from high computational costs and limited scalability to large datasets. Method: We introduce PSCB—the first scalable, large-scale post-hoc self-consistency benchmark—to systematically expose pervasive explanation-decision misalignment. We propose a decoupled self-consistency metric that disentangles explanation fidelity from mere answer consistency, overcoming the limitations of conventional self-consistency scores. Further, we design a DPO-based fine-tuning method to jointly align explanations with underlying decision-relevant features. Contribution/Results: Experiments demonstrate substantial improvements in explanation–critical-feature alignment across diverse tasks, with strong cross-domain robustness. Our framework establishes a new paradigm for trustworthy, faithful LLM explanations, enabling scalable, rigorous, and interpretable evaluation and refinement of post-hoc interpretability methods.
📝 Abstract
Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their decisions. Yet, studies show that these post-hoc explanations often misrepresent the true decision process, as revealed by mismatches in feature importance. Despite growing evidence of this inconsistency, no systematic solutions have emerged, partly due to the high cost of estimating feature importance, which limits evaluations to small datasets. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB) - a large-scale benchmark of decisions spanning diverse tasks and models, each paired with LLM-generated explanations and corresponding feature importance scores. Analysis of PSCB reveals that self-consistency scores barely differ between correct and incorrect predictions. We also show that the standard metric fails to meaningfully distinguish between explanations. To overcome this limitation, we propose an alternative metric that more effectively captures variation in explanation quality. We use it to fine-tune LLMs via Direct Preference Optimization (DPO), leading to significantly better alignment between explanations and decision-relevant features, even under domain shift. Our findings point to a scalable path toward more trustworthy, self-consistent LLMs.