Benchmarking Gaslighting Attacks Against Speech Large Language Models

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the critical robustness gap of speech large language models (Speech LLMs) against adversarial speech inputs. We introduce the first “gaslighting attack” framework, systematically characterizing five manipulative speech prompting strategies—anger induction, cognitive interference, irony, implicit misdirection, and expert negation—and integrating them with acoustic perturbations for multimodal robustness evaluation. Experiments across five state-of-the-art speech and multimodal LLMs on over 10,000 samples demonstrate that the attack reduces average accuracy by 24.3%, revealing profound vulnerabilities in reasoning consistency and behavioral stability. Our study bridges a key gap in adversarial research for speech interfaces and establishes the first cognitive-layer manipulation assessment paradigm tailored to Speech LLMs. It provides both theoretical foundations and empirical benchmarks essential for developing trustworthy speech AI systems.

Technology Category

Application Category

📝 Abstract

As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Speech LLM vulnerability to manipulative gaslighting attacks

Assessing model robustness against strategic cognitive disruption prompts

Measuring performance degradation and behavioral responses in speech AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing gaslighting attacks to test Speech LLMs

Using five manipulation strategies to assess robustness

Evaluating models with acoustic perturbations for multi-modal analysis

🔎 Similar Papers

LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation