Quality Without Usefulness: LLM-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study investigates whether high-quality natural language explanations generated by large language models genuinely enhance users’ decision-making performance in real-world tasks or merely induce illusory trust. Through five controlled experiments in a time-series energy forecasting setting, the authors systematically evaluate the impact of such explanations on task accuracy and reliability judgments, employing factorial designs, placebo controls, and out-of-distribution detection. They identify and formally define a “quality-usefulness gap” in explainable AI (XAI): explanations exhibiting high textual quality do not necessarily support effective decision-making; instead, their mere presence inflates user confidence and impairs the ability to detect model failures. Results demonstrate that these explanations fail to improve accuracy across five task types and can even lead to misleading overconfidence.

📝 Abstract

Prior work shows that Large Language Models (LLMs) can transform Explainable AI (XAI) outputs into Natural Language Explanations (NLEs) that score highly on quality metrics such as plausibility, coherence, and comprehensibility. But does explanation quality translate to practical usefulness? We investigate this question in a time-series energy forecasting domain through five controlled experiments (2,730 judgments across 60 test instances), each operationalising a distinct facet of usefulness studied in the XAI literature. Holding NLE quality constant at the high levels established by a prior factorial study, we find that NLEs do not improve task accuracy on any of the five tasks, while inflating self-reported confidence. A placebic control shows that this confidence boost is driven by text presence rather than content. In an out-of-distribution detection task, NLEs reduce the LLM judge's ability to flag unreliable predictions, providing false reassurance that masks model failure. We characterise these findings as the Quality-Usefulness Gap and argue that evaluation of the XAI-to-NLE pipeline must extend beyond text-quality metrics to downstream task performance.

Problem

Research questions and friction points this paper is trying to address.

Explainable AI

Natural Language Explanations

Large Language Models

Explanation Quality

Decision Usefulness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality-Usefulness Gap

Natural Language Explanations

Explainable AI