Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether self-explanations generated by large language models (LLMs) can improve the accuracy of human and LLM predictions regarding the models’ behavior in counterfactual scenarios. To this end, we introduce a novel integration of pragmatic perturbations with a counterfactual simulatability framework to construct test cases, and conduct a systematic evaluation using chain-of-thought and post-hoc explanation generation, joint human–LLM assessments, and qualitative analysis of free-text responses. Our findings demonstrate that self-explanations significantly enhance prediction accuracy, though this effect is moderated by the choice of perturbation strategy and the evaluators’ reasoning capabilities. Further analysis of user-generated rationales corroborates the constructive role of explanations in shaping human judgment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can produce verbalized self-explanations, yet prior studies suggest that such rationales may not reliably reflect the model's true decision process. We ask whether these explanations nevertheless help users predict model behavior, operationalized as counterfactual simulatability. Using StrategyQA, we evaluate how well humans and LLM judges can predict a model's answers to counterfactual follow-up questions, with and without access to the model's chain-of-thought or post-hoc explanations. We compare LLM-generated counterfactuals with pragmatics-based perturbations as alternative ways to construct test cases for assessing the potential usefulness of explanations. Our results show that self-explanations consistently improve simulation accuracy for both LLM judges and humans, but the degree and stability of gains depend strongly on the perturbation strategy and judge strength. We also conduct a qualitative analysis of free-text justifications written by human users when predicting the model's behavior, which provides evidence that access to explanations helps humans form more accurate predictions on the perturbed questions.
Problem

Research questions and friction points this paper is trying to address.

self-explanations
counterfactual simulatability
model behavior prediction
large language models
pragmatic perturbations
Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual simulatability
self-explanations
pragmatic perturbations
chain-of-thought
model interpretability
🔎 Similar Papers
No similar papers found.
P
Pingjun Hong
Faculty of Computer Science, University of Vienna, Vienna, Austria
Benjamin Roth
Benjamin Roth
University of Vienna
Natural Language ProcessingMachine Learning