🤖 AI Summary
This work addresses the critical robustness gap of speech large language models (Speech LLMs) against adversarial speech inputs. We introduce the first “gaslighting attack” framework, systematically characterizing five manipulative speech prompting strategies—anger induction, cognitive interference, irony, implicit misdirection, and expert negation—and integrating them with acoustic perturbations for multimodal robustness evaluation. Experiments across five state-of-the-art speech and multimodal LLMs on over 10,000 samples demonstrate that the attack reduces average accuracy by 24.3%, revealing profound vulnerabilities in reasoning consistency and behavioral stability. Our study bridges a key gap in adversarial research for speech interfaces and establishes the first cognitive-layer manipulation assessment paradigm tailored to Speech LLMs. It provides both theoretical foundations and empirical benchmarks essential for developing trustworthy speech AI systems.
📝 Abstract
As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.