Can LLMs Explain Themselves Counterfactually?

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large language models (LLMs) lack systematic evaluation of their ability to generate self-generated counterfactual explanations (SCEs) that are both effective and logically consistent with their own predictions. Method: This work formally defines dual criteria—SCE effectiveness and intrinsic consistency—and introduces a prompt-engineering–based self-explanation paradigm. It conducts controlled experiments across model families, parameter scales, temperature settings, and datasets, and proposes a two-dimensional evaluation framework. Results: Empirical analysis reveals severe instability in SCE generation: over 50% of SCEs produced by mainstream LLMs contradict the models’ own predictions, exposing a critical explanation–prediction inconsistency. This finding identifies a fundamental trustworthiness bottleneck in LLM self-explanation and establishes a novel benchmark for explainable AI, offering concrete directions for improving explanatory fidelity and coherence.

Technology Category

Application Category

📝 Abstract

Explanations are an important tool for gaining insights into the behavior of ML models, calibrating user trust and ensuring regulatory compliance. Past few years have seen a flurry of post-hoc methods for generating model explanations, many of which involve computing model gradients or solving specially designed optimization problems. However, owing to the remarkable reasoning abilities of Large Language Model (LLMs), self-explanation, that is, prompting the model to explain its outputs has recently emerged as a new paradigm. In this work, we study a specific type of self-explanations, self-generated counterfactual explanations (SCEs). We design tests for measuring the efficacy of LLMs in generating SCEs. Analysis over various LLM families, model sizes, temperature settings, and datasets reveals that LLMs sometimes struggle to generate SCEs. Even when they do, their prediction often does not agree with their own counterfactual reasoning.

Problem

Research questions and friction points this paper is trying to address.

LLMs generate counterfactual self-explanations

Evaluate efficacy of LLMs in SCEs

LLMs struggle with consistent counterfactual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate self-counterfactual explanations

Tests measure LLM efficacy in SCEs

Analysis reveals LLM SCE generation challenges

🔎 Similar Papers

No similar papers found.

Authors to Follow