🤖 AI Summary
Current safety mechanisms for large audio language models are primarily designed for text and struggle to defend against voice-based adversarial attacks. This work proposes an audio narrative jailbreaking attack that leverages instruction-following text-to-speech (TTS) models to embed harmful instructions within synthetically generated speech featuring narrative structure, while manipulating acoustic features to bypass safety filters. The method achieves a 98.26% attack success rate on state-of-the-art models such as Gemini 2.0 Flash—the highest reported to date for audio-language model jailbreaking—significantly outperforming text-only baselines. These results expose critical vulnerabilities in existing safety alignment protocols and underscore the necessity of jointly modeling linguistic and paralinguistic information to develop more robust multimodal safety frameworks.
📝 Abstract
Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.