A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating the faithfulness of self-explanations generated by large language models, which existing methods struggle to assess due to the absence of metrics aligned with the model’s actual reasoning process. The authors propose a novel evaluation paradigm centered on predictive utility: an explanation is deemed faithful if it enables humans to more accurately predict the model’s behavior on counterfactual inputs. To operationalize this, they introduce Normalized Simulatability Gain (NSG), a metric that quantifies the predictive value of explanations through simulatability gains, circumventing limitations of traditional approaches reliant on adversarial or error-detection paradigms. Large-scale experiments across 7,000 counterfactual samples in health, business, and ethics domains, involving 18 state-of-the-art models, demonstrate that self-explanations significantly improve human prediction accuracy (NSG of 11–37%), outperform externally generated explanations, and generally provide positive predictive utility—though 5–15% of self-explanations are found to be substantially misleading.

Technology Category

Application Category

📝 Abstract
LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model's true reasoning process is poorly understood. Existing faithfulness metrics have critical limitations, typically relying on identifying unfaithfulness via adversarial prompting or detecting reasoning errors. These methods overlook the predictive value of explanations. We introduce Normalized Simulatability Gain (NSG), a general and scalable metric based on the idea that a faithful explanation should allow an observer to learn a model's decision-making criteria, and thus better predict its behavior on related inputs. We evaluate 18 frontier proprietary and open-weight models, e.g., Gemini 3, GPT-5.2, and Claude 4.5, on 7,000 counterfactuals from popular datasets covering health, business, and ethics. We find self-explanations substantially improve prediction of model behavior (11-37% NSG). Self-explanations also provide more predictive information than explanations generated by external models, even when those models are stronger. This implies an advantage from self-knowledge that external explanation methods cannot replicate. Our approach also reveals that, across models, 5-15% of self-explanations are egregiously misleading. Despite their imperfections, we show a positive case for self-explanations: they encode information that helps predict model behavior.
Problem

Research questions and friction points this paper is trying to address.

faithfulness
self-explanations
model behavior prediction
explainability
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-explanation
faithfulness
Normalized Simulatability Gain
model interpretability
predictive utility
🔎 Similar Papers
No similar papers found.
H
Harry Mayne
University of Oxford
J
J. S. Kang
University of California, Berkeley
D
Dewi Gould
Independent
K
K. Ramchandran
University of California, Berkeley
Adam Mahdi
Adam Mahdi
Associate Professor, University of Oxford
large language modelsmultimodal AIdigital health
Noah Y. Siegel
Noah Y. Siegel
Google DeepMind
AI AlignmentLarge Language ModelsScalable OversightReinforcement LearningRobotics