Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio watermarking methods for protecting intellectual property and enabling voice provenance in text-to-speech (TTS) diffusion models suffer from poor generalizability and severe audio quality degradation. Method: We propose Smark, the first universal watermarking framework for TTS diffusion models. Smark innovatively incorporates discrete wavelet transform (DWT) to embed watermarks in stable low-frequency subbands, ensuring high fidelity and robustness against removal. Leveraging commonalities in the reverse diffusion process, it designs a lightweight, model-agnostic watermark encoder-decoder. Contribution/Results: Under diverse practical attacks—including compression, resampling, and additive noise—Smark achieves watermark extraction accuracy ≥98.7% and maintains speech naturalness with a mean opinion score (MOS) ≥4.2. It significantly outperforms existing model-specific watermarking approaches, establishing a practical, robust, and universally applicable paradigm for copyright protection of TTS diffusion models.

Technology Category

Application Category

📝 Abstract
Text-to-Speech (TTS) diffusion models generate high-quality speech, which raises challenges for the model intellectual property protection and speech tracing for legal use. Audio watermarking is a promising solution. However, due to the structural differences among various TTS diffusion models, existing watermarking methods are often designed for a specific model and degrade audio quality, which limits their practical applicability. To address this dilemma, this paper proposes a universal watermarking scheme for TTS diffusion models, termed Smark. This is achieved by designing a lightweight watermark embedding framework that operates in the common reverse diffusion paradigm shared by all TTS diffusion models. To mitigate the impact on audio quality, Smark utilizes the discrete wavelet transform (DWT) to embed watermarks into the relatively stable low-frequency regions of the audio, which ensures seamless watermark-audio integration and is resistant to removal during the reverse diffusion process. Extensive experiments are conducted to evaluate the audio quality and watermark performance in various simulated real-world attack scenarios. The experimental results show that Smark achieves superior performance in both audio quality and watermark extraction accuracy.
Problem

Research questions and friction points this paper is trying to address.

Develops a universal watermark for TTS diffusion models
Embeds watermarks via wavelet transform to preserve audio quality
Enables intellectual property protection and speech tracing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal watermarking via lightweight embedding framework
Discrete wavelet transform embeds watermarks in low-frequency regions
Seamless integration resistant to reverse diffusion removal
🔎 Similar Papers
No similar papers found.
Y
Yichuan Zhang
Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan
C
Chengxin Li
Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan
Yujie Gu
Yujie Gu
Aptiv
Adaptive beamformingarray processingautomotive radarmachine learningwaveform design