LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work exposes the systematic unreliability of self-generated self-contrastive explanations (SCEs) from large language models (LLMs). While often predictively valid, SCEs routinely violate minimality—introducing redundant perturbations; conversely, when explicitly instructed to produce “minimal” counterfactuals, they frequently under-perturb and fail to flip predictions. The study provides the first rigorous empirical demonstration of an inherent trade-off between validity and minimality in SCEs, challenging their foundational suitability for explainable AI. Through controlled experiments across diverse LLMs (e.g., Llama, GPT series) and benchmark datasets, the authors employ both automated metrics and human evaluation to validate that SCEs systematically misrepresent model decision boundaries—posing substantial risks in high-stakes applications. The findings underscore critical limitations in using SCEs for faithful model introspection. Code is publicly released.

Technology Category

Application Category

📝 Abstract

To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating validity and minimality of self-generated counterfactual explanations

Assessing reliability of LLM self-explanations for decision-making insights

Identifying trade-off between explanation validity and minimal modifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating self-generated counterfactual explanations validity

Testing minimal input modifications for decision changes

Identifying validity-minimality trade-off across models

🔎 Similar Papers

Evaluating the Reliability of Self-Explanations in Large Language Models