Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address privacy and security risks arising from unauthorized voice cloning in text-to-speech (TTS) synthesis, existing audio watermarking methods struggle to balance robustness and computational efficiency. This paper proposes a lightweight deep learning-based speech watermarking framework built upon progressive knowledge distillation: a reversible neural network serves as a highly robust teacher model, whose watermarking capability is systematically transferred via multi-stage knowledge distillation to a compact student model—thereby synergizing the computational efficiency of digital signal processing (DSP) with the strong robustness of deep learning. Experiments demonstrate a 93.6% reduction in computational cost, an average detection F1-score of 99.6% under diverse distortion attacks (e.g., compression, noise, resampling), and a PESQ score of 4.30, confirming high imperceptibility and real-time generation capability. The framework delivers an efficient, practical, and secure watermarking solution for controllable TTS systems.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of speech generative models, unauthorized voice cloning poses significant privacy and security risks. Speech watermarking offers a viable solution for tracing sources and preventing misuse. Current watermarking technologies fall mainly into two categories: DSP-based methods and deep learning-based methods. DSP-based methods are efficient but vulnerable to attacks, whereas deep learning-based methods offer robust protection at the expense of significantly higher computational cost. To improve the computational efficiency and enhance the robustness, we propose PKDMark, a lightweight deep learning-based speech watermarking method that leverages progressive knowledge distillation (PKD). Our approach proceeds in two stages: (1) training a high-performance teacher model using an invertible neural network-based architecture, and (2) transferring the teacher's capabilities to a compact student model through progressive knowledge distillation. This process reduces computational costs by 93.6% while maintaining high level of robust performance and imperceptibility. Experimental results demonstrate that our distilled model achieves an average detection F1 score of 99.6% with a PESQ of 4.30 in advanced distortions, enabling efficient speech watermarking for real-time speech synthesis applications.

Problem

Research questions and friction points this paper is trying to address.

Unauthorized voice cloning poses privacy and security risks

Existing watermarking methods trade robustness for computational efficiency

Need lightweight deep learning-based watermarking for real-time synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive knowledge distillation for lightweight watermarking

Invertible neural network teacher model training

Compact student model with 93.6% cost reduction

🔎 Similar Papers

WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification