🤖 AI Summary
This study addresses the vulnerability of AI-generated natural language explanations to malicious manipulation, which can induce excessive user trust in erroneous model outputs and thereby compromise the safety of human-AI collaborative decision-making. For the first time, the work conceptualizes AI explanations as a cognitive adversarial channel and introduces “Adversarial Explanation Attacks” (AEAs)—a framework that systematically manipulates four dimensions of explanations: reasoning patterns, evidence types, communication styles, and presentation formats. Through a controlled experiment with 205 participants, the research demonstrates that AEAs can elevate trust in incorrect outputs to levels nearly matching those for correct outputs, particularly in high-difficulty, fact-driven tasks. The effect is especially pronounced among younger individuals, those with lower education levels, or users exhibiting high baseline trust in AI. The study further proposes a “trust calibration gap” metric to quantify this risk, revealing how explanatory content can be weaponized to manipulate human judgment.
📝 Abstract
Most adversarial threats in artificial intelligence target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between correct and incorrect outputs under adversarial explanations. By incorporating this gap, AEAs explore the daunting threats in which persuasive explanations reinforce users'trust in incorrect predictions. To characterize this threat, we conducted a controlled experiment (n = 205), systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI. This is the first systematic security study that treats explanations as an adversarial cognitive channel and quantifies their impact on human trust in AI-assisted decision making.