Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work identifies a content-centric misuse threat against large text-to-speech (TTS) models: adversaries can bypass safety alignment and input/output filtering to generate high-fidelity speech containing harmful content. Leveraging the characteristic of large audio-language models (LALMs)—which refuse harmful prompts yet faithfully vocalize provided text—we propose HARMGEN, the first systematic attack suite exploring two cross-modal misuse pathways: semantic obfuscation (e.g., concatenation, token shuffling) and audio-channel injection (e.g., phoneme manipulation, spelling-out). It introduces five novel attack methods. Evaluated across five commercial TTS systems, HARMGEN significantly reduces refusal rates and increases speech toxicity. Experiments show that existing proactive defenses detect only 57%–93% of such attacks, while deepfake detectors exhibit poor robustness against high-fidelity adversarial audio. This work uncovers a new dimension of content security risk in multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.

Problem

Research questions and friction points this paper is trying to address.

Evaluating how large text-to-speech models generate harmful audio content

Developing attacks to bypass safety filters in commercial TTS systems

Assessing vulnerabilities in both reactive and proactive audio moderation defenses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic obfuscation conceals harmful content within text

Audio-modality exploits inject harmful content through auxiliary channels

Proactive moderation detects majority of cross-modal attacks

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection