Faithfulness of LLM Self-Explanations for Commonsense Tasks: Larger Is Better, and Instruction-Tuning Allows Trade-Offs but Not Pareto Dominance

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work investigates the faithfulness of self-generated explanations produced by large language models (LLMs) during commonsense reasoning—a critical issue for model interpretability and safety oversight. We propose phi-CCT, a lightweight counterfactual testing framework, and conduct systematic evaluation across 62 models. Our study is the first to establish a strong positive correlation between model scale and explanation faithfulness. We find that instruction tuning only shifts the trade-off between true and false positives without surpassing the Pareto frontier of faithfulness achieved by same-scale pre-trained models. Crucially, we identify explanation redundancy as the primary bottleneck limiting faithfulness. To enable scalable, cross-model-family assessment, we design a standardized, low-overhead, and reproducible evaluation framework. Our results clarify a key direction for improving faithfulness: scaling model size yields greater gains than relying solely on post-training optimization.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become increasingly capable, ensuring that their self-generated explanations are faithful to their internal decision-making process is critical for safety and oversight. In this work, we conduct a comprehensive counterfactual faithfulness analysis across 62 models from 8 families, encompassing both pretrained and instruction-tuned variants and significantly extending prior studies of counterfactual tests. We introduce phi-CCT, a simplified variant of the Correlational Counterfactual Test, which avoids the need for token probabilities while explaining most of the variance of the original test. Our findings reveal clear scaling trends: larger models are consistently more faithful on our metrics. However, when comparing instruction-tuned and human-imitated explanations, we find that observed differences in faithfulness can often be attributed to explanation verbosity, leading to shifts along the true-positive/false-positive Pareto frontier. While instruction-tuning and prompting can influence this trade-off, we find limited evidence that they fundamentally expand the frontier of explanatory faithfulness beyond what is achievable with pretrained models of comparable size. Our analysis highlights the nuanced relationship between instruction-tuning, verbosity, and the faithful representation of model decision processes.

Problem

Research questions and friction points this paper is trying to address.

Ensuring LLM self-explanations align with internal decision-making.

Analyzing faithfulness across 62 models using phi-CCT.

Exploring trade-offs between instruction-tuning and explanation verbosity.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced phi-CCT for simplified counterfactual analysis

Analyzed 62 models across 8 families comprehensively

Explored instruction-tuning effects on explanation faithfulness

🔎 Similar Papers

FaithLM: Towards Faithful Explanations for Large Language Models