🤖 AI Summary
This work investigates the faithfulness of self-generated explanations produced by large language models (LLMs) during commonsense reasoning—a critical issue for model interpretability and safety oversight. We propose phi-CCT, a lightweight counterfactual testing framework, and conduct systematic evaluation across 62 models. Our study is the first to establish a strong positive correlation between model scale and explanation faithfulness. We find that instruction tuning only shifts the trade-off between true and false positives without surpassing the Pareto frontier of faithfulness achieved by same-scale pre-trained models. Crucially, we identify explanation redundancy as the primary bottleneck limiting faithfulness. To enable scalable, cross-model-family assessment, we design a standardized, low-overhead, and reproducible evaluation framework. Our results clarify a key direction for improving faithfulness: scaling model size yields greater gains than relying solely on post-training optimization.
📝 Abstract
As large language models (LLMs) become increasingly capable, ensuring that their self-generated explanations are faithful to their internal decision-making process is critical for safety and oversight. In this work, we conduct a comprehensive counterfactual faithfulness analysis across 62 models from 8 families, encompassing both pretrained and instruction-tuned variants and significantly extending prior studies of counterfactual tests. We introduce phi-CCT, a simplified variant of the Correlational Counterfactual Test, which avoids the need for token probabilities while explaining most of the variance of the original test. Our findings reveal clear scaling trends: larger models are consistently more faithful on our metrics. However, when comparing instruction-tuned and human-imitated explanations, we find that observed differences in faithfulness can often be attributed to explanation verbosity, leading to shifts along the true-positive/false-positive Pareto frontier. While instruction-tuning and prompting can influence this trade-off, we find limited evidence that they fundamentally expand the frontier of explanatory faithfulness beyond what is achievable with pretrained models of comparable size. Our analysis highlights the nuanced relationship between instruction-tuning, verbosity, and the faithful representation of model decision processes.