Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Current safety mechanisms for large audio language models are primarily designed for text and struggle to defend against voice-based adversarial attacks. This work proposes an audio narrative jailbreaking attack that leverages instruction-following text-to-speech (TTS) models to embed harmful instructions within synthetically generated speech featuring narrative structure, while manipulating acoustic features to bypass safety filters. The method achieves a 98.26% attack success rate on state-of-the-art models such as Gemini 2.0 Flash—the highest reported to date for audio-language model jailbreaking—significantly outperforming text-only baselines. These results expose critical vulnerabilities in existing safety alignment protocols and underscore the necessity of jointly modeling linguistic and paralinguistic information to develop more robust multimodal safety frameworks.

Technology Category

Application Category

📝 Abstract

Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.

Problem

Research questions and friction points this paper is trying to address.

audio-language models

jailbreak attacks

speech security

narrative audio

safety mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio narrative attacks

text-to-speech jailbreak

large audio-language models