🤖 AI Summary
To address emerging security threats—such as voice cloning—posed by generative large language model–driven end-to-end (E2E) text-to-speech (TTS) systems, this paper proposes the first proactive defense framework tailored for automatic speech recognition (ASR)-based transcription scenarios. Methodologically, it integrates voice identity protection—via encoder ensembling and feature extractor hardening—with pronunciation-level perturbation through ASR-targeted adversarial examples, all constrained by psychoacoustic principles to ensure perceptual imperceptibility. Its key contribution lies in being the first to jointly mitigate vulnerabilities in the ASR transcription stage of E2E TTS pipelines, thereby achieving dual protection of speaker identity and phonetic fidelity. The framework demonstrates cross-lingual and cross-platform compatibility, validated on 16 open-source TTS synthesizers and three commercial TTS APIs using both English and Chinese datasets. Empirical evaluation confirms significant suppression of voice cloning attacks, and real-world deployment has been successfully completed.
📝 Abstract
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.