🤖 AI Summary
Zero-shot voice cloning poses severe privacy risks, enabling high-fidelity speaker identity replication from just a few seconds of reference audio. To address this, we propose CloneShield—the first general-purpose time-domain adversarial defense framework for zero-shot text-to-speech (TTS). Our approach requires no target text prior and achieves robust, cross-speaker and cross-utterance protection. We formulate a multi-objective optimization problem—jointly preserving speech quality and suppressing cloning—and solve it via the Multiple Gradient Descent Algorithm (MGDA). Furthermore, we introduce Mel-spectrogram-guided adversarial perturbation decomposition and sample-level fine-tuning to inject imperceptible yet highly effective perturbations in the time domain. Extensive evaluation across three state-of-the-art zero-shot TTS systems and five benchmark datasets shows CloneShield maintains near-lossless speech quality (PESQ = 3.90), while reducing cloned output PESQ to 1.07 and speaker recognition similarity (SRS) to only 0.08. A 60-participant subjective study confirms its simultaneous high fidelity and strong anti-cloning capability.
📝 Abstract
Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns, allowing highly accurate vocal identity replication from just a few seconds of reference audio, while retaining the speaker's vocal authenticity. In this paper, we introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice cloning. Our method provides protection that is robust across speakers and utterances, without requiring any prior knowledge of the synthesized text. We formulate perturbation generation as a multi-objective optimization problem, and propose Multi-Gradient Descent Algorithm (MGDA) to ensure the robust protection across diverse utterances. To preserve natural auditory perception for users, we decompose the adversarial perturbation via Mel-spectrogram representations and fine-tune it for each sample. This design ensures imperceptibility while maintaining strong degradation effects on zero-shot cloned outputs. Experiments on three state-of-the-art zero-shot TTS systems, five benchmark datasets and evaluations from 60 human listeners demonstrate that our method preserves near-original audio quality in protected inputs (PESQ = 3.90, SRS = 0.93) while substantially degrading both speaker similarity and speech quality in cloned samples (PESQ = 1.07, SRS = 0.08).