๐ค AI Summary
To address privacy and security risks arising from unauthorized voice cloning in text-to-speech (TTS) synthesis, existing audio watermarking methods struggle to balance robustness and computational efficiency. This paper proposes a lightweight deep learning-based speech watermarking framework built upon progressive knowledge distillation: a reversible neural network serves as a highly robust teacher model, whose watermarking capability is systematically transferred via multi-stage knowledge distillation to a compact student modelโthereby synergizing the computational efficiency of digital signal processing (DSP) with the strong robustness of deep learning. Experiments demonstrate a 93.6% reduction in computational cost, an average detection F1-score of 99.6% under diverse distortion attacks (e.g., compression, noise, resampling), and a PESQ score of 4.30, confirming high imperceptibility and real-time generation capability. The framework delivers an efficient, practical, and secure watermarking solution for controllable TTS systems.
๐ Abstract
With the rapid advancement of speech generative models, unauthorized voice cloning poses significant privacy and security risks. Speech watermarking offers a viable solution for tracing sources and preventing misuse. Current watermarking technologies fall mainly into two categories: DSP-based methods and deep learning-based methods. DSP-based methods are efficient but vulnerable to attacks, whereas deep learning-based methods offer robust protection at the expense of significantly higher computational cost. To improve the computational efficiency and enhance the robustness, we propose PKDMark, a lightweight deep learning-based speech watermarking method that leverages progressive knowledge distillation (PKD). Our approach proceeds in two stages: (1) training a high-performance teacher model using an invertible neural network-based architecture, and (2) transferring the teacher's capabilities to a compact student model through progressive knowledge distillation. This process reduces computational costs by 93.6% while maintaining high level of robust performance and imperceptibility. Experimental results demonstrate that our distilled model achieves an average detection F1 score of 99.6% with a PESQ of 4.30 in advanced distortions, enabling efficient speech watermarking for real-time speech synthesis applications.