Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation

๐Ÿ“… 2025-09-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address privacy and security risks arising from unauthorized voice cloning in text-to-speech (TTS) synthesis, existing audio watermarking methods struggle to balance robustness and computational efficiency. This paper proposes a lightweight deep learning-based speech watermarking framework built upon progressive knowledge distillation: a reversible neural network serves as a highly robust teacher model, whose watermarking capability is systematically transferred via multi-stage knowledge distillation to a compact student modelโ€”thereby synergizing the computational efficiency of digital signal processing (DSP) with the strong robustness of deep learning. Experiments demonstrate a 93.6% reduction in computational cost, an average detection F1-score of 99.6% under diverse distortion attacks (e.g., compression, noise, resampling), and a PESQ score of 4.30, confirming high imperceptibility and real-time generation capability. The framework delivers an efficient, practical, and secure watermarking solution for controllable TTS systems.

Technology Category

Application Category

๐Ÿ“ Abstract
With the rapid advancement of speech generative models, unauthorized voice cloning poses significant privacy and security risks. Speech watermarking offers a viable solution for tracing sources and preventing misuse. Current watermarking technologies fall mainly into two categories: DSP-based methods and deep learning-based methods. DSP-based methods are efficient but vulnerable to attacks, whereas deep learning-based methods offer robust protection at the expense of significantly higher computational cost. To improve the computational efficiency and enhance the robustness, we propose PKDMark, a lightweight deep learning-based speech watermarking method that leverages progressive knowledge distillation (PKD). Our approach proceeds in two stages: (1) training a high-performance teacher model using an invertible neural network-based architecture, and (2) transferring the teacher's capabilities to a compact student model through progressive knowledge distillation. This process reduces computational costs by 93.6% while maintaining high level of robust performance and imperceptibility. Experimental results demonstrate that our distilled model achieves an average detection F1 score of 99.6% with a PESQ of 4.30 in advanced distortions, enabling efficient speech watermarking for real-time speech synthesis applications.
Problem

Research questions and friction points this paper is trying to address.

Unauthorized voice cloning poses privacy and security risks
Existing watermarking methods trade robustness for computational efficiency
Need lightweight deep learning-based watermarking for real-time synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive knowledge distillation for lightweight watermarking
Invertible neural network teacher model training
Compact student model with 93.6% cost reduction
Y
Yang Cui
Microsoft, Beijing, China
P
Peter Pan
Microsoft, Beijing, China
L
Lei He
Microsoft, Beijing, China
Sheng Zhao
Sheng Zhao
Microsoft
Speech