CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

Zero-shot voice cloning poses severe privacy risks, enabling high-fidelity speaker identity replication from just a few seconds of reference audio. To address this, we propose CloneShield—the first general-purpose time-domain adversarial defense framework for zero-shot text-to-speech (TTS). Our approach requires no target text prior and achieves robust, cross-speaker and cross-utterance protection. We formulate a multi-objective optimization problem—jointly preserving speech quality and suppressing cloning—and solve it via the Multiple Gradient Descent Algorithm (MGDA). Furthermore, we introduce Mel-spectrogram-guided adversarial perturbation decomposition and sample-level fine-tuning to inject imperceptible yet highly effective perturbations in the time domain. Extensive evaluation across three state-of-the-art zero-shot TTS systems and five benchmark datasets shows CloneShield maintains near-lossless speech quality (PESQ = 3.90), while reducing cloned output PESQ to 1.07 and speaker recognition similarity (SRS) to only 0.08. A 60-participant subjective study confirms its simultaneous high fidelity and strong anti-cloning capability.

Technology Category

Application Category

📝 Abstract

Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns, allowing highly accurate vocal identity replication from just a few seconds of reference audio, while retaining the speaker's vocal authenticity. In this paper, we introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice cloning. Our method provides protection that is robust across speakers and utterances, without requiring any prior knowledge of the synthesized text. We formulate perturbation generation as a multi-objective optimization problem, and propose Multi-Gradient Descent Algorithm (MGDA) to ensure the robust protection across diverse utterances. To preserve natural auditory perception for users, we decompose the adversarial perturbation via Mel-spectrogram representations and fine-tune it for each sample. This design ensures imperceptibility while maintaining strong degradation effects on zero-shot cloned outputs. Experiments on three state-of-the-art zero-shot TTS systems, five benchmark datasets and evaluations from 60 human listeners demonstrate that our method preserves near-original audio quality in protected inputs (PESQ = 3.90, SRS = 0.93) while substantially degrading both speaker similarity and speech quality in cloned samples (PESQ = 1.07, SRS = 0.08).

Problem

Research questions and friction points this paper is trying to address.

Defend against zero-shot voice cloning attacks

Protect vocal identity without prior text knowledge

Ensure robust perturbation across diverse utterances

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal time-domain adversarial perturbation framework

Multi-Gradient Descent Algorithm for robust protection

Mel-spectrogram decomposition for imperceptible perturbation

🔎 Similar Papers

No similar papers found.