Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Speech-driven multimodal large language models (e.g., SpeechGPT) exhibit alignment vulnerabilities unique to the speech modality—stemming from temporal characteristics of speech, phonetic variability, and ASR uncertainty—enabling adversaries to bypass safety guardrails. Method: We propose the first white-box adversarial attack framework targeting speech tokenizers: by reverse-engineering the speech tokenizer, we perform token-level perturbation optimization directly in the speech embedding space to synthesize playable adversarial audio—eliminating reliance on black-box TTS or manual construction. Contribution/Results: Our method achieves an 89% attack success rate on SpeechGPT, substantially outperforming existing speech jailbreaking approaches. It is the first to expose structural weaknesses in speech-modal alignment mechanisms, offering a novel paradigm for security evaluation and robust training of multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced the naturalness and flexibility of human computer interaction by enabling seamless understanding across text, vision, and audio modalities. Among these, voice enabled models such as SpeechGPT have demonstrated considerable improvements in usability, offering expressive, and emotionally responsive interactions that foster deeper connections in real world communication scenarios. However, the use of voice introduces new security risks, as attackers can exploit the unique characteristics of spoken language, such as timing, pronunciation variability, and speech to text translation, to craft inputs that bypass defenses in ways not seen in text-based systems. Despite substantial research on text based jailbreaks, the voice modality remains largely underexplored in terms of both attack strategies and defense mechanisms. In this work, we present an adversarial attack targeting the speech input of aligned MLLMs in a white box scenario. Specifically, we introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences. These sequences are then synthesized into audio prompts, which effectively bypass alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT, our approach achieves up to 89 percent attack success rate across multiple restricted tasks, significantly outperforming existing voice based jailbreak methods. Our findings shed light on the vulnerabilities of voice-enabled multimodal systems and to help guide the development of more robust next-generation MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Exposing vulnerabilities in SpeechGPT via audio jailbreak attacks

Investigating underexplored voice modality security risks in MLLMs

Developing token-level attacks to bypass alignment safeguards

Innovation

Methods, ideas, or system contributions that make the work stand out.

White-box adversarial attack on speech input

Token-level attack exploiting speech tokenization

Synthesizing adversarial audio prompts effectively

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs