🤖 AI Summary
This study investigates whether high-quality natural language explanations generated by large language models genuinely enhance users’ decision-making performance in real-world tasks or merely induce illusory trust. Through five controlled experiments in a time-series energy forecasting setting, the authors systematically evaluate the impact of such explanations on task accuracy and reliability judgments, employing factorial designs, placebo controls, and out-of-distribution detection. They identify and formally define a “quality-usefulness gap” in explainable AI (XAI): explanations exhibiting high textual quality do not necessarily support effective decision-making; instead, their mere presence inflates user confidence and impairs the ability to detect model failures. Results demonstrate that these explanations fail to improve accuracy across five task types and can even lead to misleading overconfidence.
📝 Abstract
Prior work shows that Large Language Models (LLMs) can transform Explainable AI (XAI) outputs into Natural Language Explanations (NLEs) that score highly on quality metrics such as plausibility, coherence, and comprehensibility. But does explanation quality translate to practical usefulness? We investigate this question in a time-series energy forecasting domain through five controlled experiments (2,730 judgments across 60 test instances), each operationalising a distinct facet of usefulness studied in the XAI literature. Holding NLE quality constant at the high levels established by a prior factorial study, we find that NLEs do not improve task accuracy on any of the five tasks, while inflating self-reported confidence. A placebic control shows that this confidence boost is driven by text presence rather than content. In an out-of-distribution detection task, NLEs reduce the LLM judge's ability to flag unreliable predictions, providing false reassurance that masks model failure. We characterise these findings as the Quality-Usefulness Gap and argue that evaluation of the XAI-to-NLE pipeline must extend beyond text-quality metrics to downstream task performance.