ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deepfake speech attacks pose severe threats to real-time voice communication security. Existing noise-injection-based defenses degrade audio quality and rely on prior knowledge of attack methods, exhibiting poor generalizability. This paper proposes ClearMask—a lossless, natural, and real-time deepfake defense framework. It disrupts speech encoder representations without introducing perceptible noise via mel-spectral selective filtering, audio style transfer, and learnable reverberation enhancement. Furthermore, a lightweight LiveMask module enables efficient streaming inference. ClearMask is the first method to achieve high-fidelity, imperceptible robust protection: it generalizes strongly across both white-box and black-box settings, against both known and unknown text-to-speech (TTS) models, and resists adaptive recovery attacks. Experiments demonstrate an average defense success rate exceeding 92% across major TTS systems, while preserving speech quality with a PESQ score above 4.2.

Technology Category

Application Category

📝 Abstract
Voice deepfake attacks, which artificially impersonate human speech for malicious purposes, have emerged as a severe threat. Existing defenses typically inject noise into human speech to compromise voice encoders in speech synthesis models. However, these methods degrade audio quality and require prior knowledge of the attack approaches, limiting their effectiveness in diverse scenarios. Moreover, real-time audios, such as speech in virtual meetings and voice messages, are still exposed to voice deepfake threats. To overcome these limitations, we propose ClearMask, a noise-free defense mechanism against voice deepfake attacks. Unlike traditional approaches, ClearMask modifies the audio mel-spectrogram by selectively filtering certain frequencies, inducing a transferable voice feature loss without injecting noise. We then apply audio style transfer to further deceive voice decoders while preserving perceived sound quality. Finally, optimized reverberation is introduced to disrupt the output of voice generation models without affecting the naturalness of the speech. Additionally, we develop LiveMask to protect streaming speech in real-time through a universal frequency filter and reverberation generator. Our experimental results show that ClearMask and LiveMask effectively prevent voice deepfake attacks from deceiving speaker verification models and human listeners, even for unseen voice synthesis models and black-box API services. Furthermore, ClearMask demonstrates resilience against adaptive attackers who attempt to recover the original audio signal from the protected speech samples.
Problem

Research questions and friction points this paper is trying to address.

Protecting voice from deepfake attacks without noise injection
Preserving audio naturalness while disrupting voice synthesis models
Enabling real-time protection for streaming speech applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective frequency filtering without noise injection
Audio style transfer to deceive voice decoders
Optimized reverberation disrupts generation while preserving naturalness
🔎 Similar Papers
Y
Yuanda Wang
Michigan State University
B
Bocheng Chen
Michigan State University
H
Hanqing Guo
University of Hawaii at M¯anoa
G
Guangjing Wang
University of South Florida
W
Weikang Ding
Michigan State University
Qiben Yan
Qiben Yan
Computer Science and Engineering, Michigan State University
Security and PrivacyCyber-Physical SystemsAI AgentInternet-of-ThingsSmart Contract