P2Mark: Plug-and-play Parameter-intrinsic Watermarking for Neural Speech Generation

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In open-source white-box settings, neural speech synthesis models are vulnerable to watermark removal, hindering reliable copyright attribution. To address this, we propose a parameter-intrinsic watermarking mechanism that embeds watermarks directly into trainable model parameters—not output audio—enabling tight coupling with weights and intrinsic non-removability. Our approach employs differentiable adapters for end-to-end joint optimization of watermark embedding and model functionality. It is compatible with both vocoder- and codec-based decoders and supports cross-architecture deployment. Experiments demonstrate state-of-the-art performance in watermark detection accuracy, perceptual imperceptibility, and robustness against removal attacks. Notably, ours is the first watermarking framework to provide reliable model-level copyright tracing and protection under open-source white-box conditions.

Technology Category

Application Category

📝 Abstract
Recently, a large number of advanced neural speech generation methods have emerged in the open-source community. Although this has facilitated the application and development of technology, it has also increased the difficulty of preventing the abuse of generated speech and protecting copyrights. Audio watermarking technology is an effective method for proactively protecting generated speech, but when the source codes and model weights of the neural speech generation methods are open-sourced, audio watermarks based on previous watermarking methods can be easily removed or manipulated. This paper proposes a Plug-and-play Parameter-intrinsic WaterMarking (P2Mark) method for neural speech generation system protection. The main advantage of P2Mark is that the watermark information is flexibly integrated into the neural speech generation model in the form of parameters by training a watermark adapter rather than injecting the watermark into the model in the form of features. After the watermark adapter with the watermark embedding is merged with the pre-trained generation model, the watermark information cannot be easily removed or manipulated. Therefore, P2Mark will be a reliable choice for proactively tracing and protecting the copyrights of neural speech generation models in open-source white-box scenarios. We validated P2Mark on two main types of decoders in neural speech generation: vocoder and codec. Experimental results show that P2Mark achieves performance comparable to state-of-the-art audio watermarking methods that cannot be used for open-source white-box protection scenarios in terms of watermark extraction accuracy, watermark imperceptibility, and robustness.
Problem

Research questions and friction points this paper is trying to address.

Prevents abuse of open-source neural speech generation models
Protects copyrights via parameter-intrinsic watermarking
Ensures watermark robustness in white-box scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play watermark adapter for neural speech
Parameter-intrinsic watermarking for model protection
Robust watermarking in open-source white-box scenarios
Yong Ren
Yong Ren
Institute of Automation, Chinese Academy of Sciences
Speech CodecText-to-speechVideo-to-audioMLLMContinual Learning
Jiangyan Yi
Jiangyan Yi
Tsinghua University
speech signal processingspeech synthesisfake audio detectioncontinual learning
T
Tao Wang
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
J
Jianhua Tao
Department of Automation, Tsinghua University, Beijing, China; Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China
Zhengqi Wen
Zhengqi Wen
Tshinghua University
LLM
C
Chenxing Li
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zheng Lian
Zheng Lian
Associate Professor, IEEE/CCF Senior Member, Institute of Automation, Chinese Academy of Sciences
Affective ComputingSentiment AnalysisMachine Learning
Ruibo Fu
Ruibo Fu
Associate Professor,CASIA
AIGCLMMIntelligent speech interactionDeepfake detection
Y
Ye Bai
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
X
Xiaohui Zhang
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China