MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the lack of effective copyright protection and provenance mechanisms for AI-generated speech by proposing a plug-and-play watermarking method tailored to Mel-conditioned text-to-speech (TTS) architectures. The approach embeds a low-energy spread-spectrum watermark, controlled by a secret key, into intermediate acoustic representations in the Mel-spectrogram domain. By leveraging time-frequency region selection and integrating in a vocoder-agnostic manner, the method enables on-the-fly watermark embedding during speech generation without requiring modifications or retraining of the vocoder. It supports multi-user key allocation and secure verification. Experiments on DiffWave and HiFi-GAN demonstrate near-perfect watermark extraction accuracy, strong robustness against compression and noise, and consistently high perceptual audio quality.

📝 Abstract

In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and HiFi-GAN. Moreover, the multi-user keyed construction enables scalable user-specific attribution, while the keyed verification mechanism limits unauthorized decoding, thereby reducing the risk of large-scale extractor probing and adversarial analysis. Extensive experiments on DiffWave and HiFi-GAN demonstrate that MelShield achieves reliable watermark extraction, approaching 100\% bit accuracy, even under signal distortions, e.g., compression and additive noise, while preserving high perceptual audio quality.

Problem

Research questions and friction points this paper is trying to address.

audio watermarking

AI-generated speech

provenance attribution

Mel-spectrogram

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio watermarking

Mel-spectrogram

TTS provenance

keyed spread-spectrum