MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses the lack of effective copyright protection and provenance mechanisms for AI-generated speech by proposing a plug-and-play watermarking method tailored to Mel-conditioned text-to-speech (TTS) architectures. The approach embeds a low-energy spread-spectrum watermark, controlled by a secret key, into intermediate acoustic representations in the Mel-spectrogram domain. By leveraging time-frequency region selection and integrating in a vocoder-agnostic manner, the method enables on-the-fly watermark embedding during speech generation without requiring modifications or retraining of the vocoder. It supports multi-user key allocation and secure verification. Experiments on DiffWave and HiFi-GAN demonstrate near-perfect watermark extraction accuracy, strong robustness against compression and noise, and consistently high perceptual audio quality.
📝 Abstract
In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and HiFi-GAN. Moreover, the multi-user keyed construction enables scalable user-specific attribution, while the keyed verification mechanism limits unauthorized decoding, thereby reducing the risk of large-scale extractor probing and adversarial analysis. Extensive experiments on DiffWave and HiFi-GAN demonstrate that MelShield achieves reliable watermark extraction, approaching 100\% bit accuracy, even under signal distortions, e.g., compression and additive noise, while preserving high perceptual audio quality.
Problem

Research questions and friction points this paper is trying to address.

audio watermarking
AI-generated speech
provenance attribution
Mel-spectrogram
copyright protection
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio watermarking
Mel-spectrogram
TTS provenance
keyed spread-spectrum
in-generation embedding
Y
Yutong Jin
Department of Electrical and Computer Engineering, Queen’s University, Kingston, ON, Canada
Q
Qi Li
Department of Electrical and Computer Engineering, Queen’s University, Kingston, ON, Canada
L
Lingshuang Liu
Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada
Jianbing Ni
Jianbing Ni
Queen's University
AI Safety and SecurityCloud-Edge SecurityMobile Network SecurityBlockchain Technology