VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To mitigate illicit voice cloning enabled by diffusion models, this paper proposes VoiceCloak, a multi-dimensional proactive defense framework. VoiceCloak injects lightweight adversarial perturbations into reference audio—uniquely integrating auditory-perception-guided identity embedding perturbation, attention-context disruption, score-magnitude amplification, and semantic-level noise corruption within the diffusion denoising process. This joint strategy simultaneously renders speaker identity unrecognizable and controllably degrades synthesis quality. Unlike passive detection or model-specific modifications, VoiceCloak operates in a black-box setting without requiring access to the target cloning system’s internal parameters. Evaluated on mainstream diffusion-based voice cloning systems, VoiceCloak achieves over 92% defense success rate, significantly reducing both naturalness and speaker identifiability of cloned speech while preserving the usability and fidelity of the original reference audio.

Technology Category

Application Category

📝 Abstract
Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.
Problem

Research questions and friction points this paper is trying to address.

Prevent unauthorized voice cloning using diffusion models
Disrupt speaker identity obfuscation in cloned audio
Degrade perceptual quality of maliciously cloned speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces adversarial perturbations to disrupt cloning
Targets speaker identity via distorted embeddings
Amplifies score magnitude to degrade speech quality
🔎 Similar Papers
No similar papers found.
Q
Qianyue Hu
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Junyan Wu
Junyan Wu
Ph.D. student from School of Computer Science and Engineering, Sun Yat-sen University
multimedia forensics and security
W
Wei Lu
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
X
Xiangyang Luo
State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China