🤖 AI Summary
This work addresses the lack of effective copyright protection and provenance mechanisms for AI-generated speech by proposing a plug-and-play watermarking method tailored to Mel-conditioned text-to-speech (TTS) architectures. The approach embeds a low-energy spread-spectrum watermark, controlled by a secret key, into intermediate acoustic representations in the Mel-spectrogram domain. By leveraging time-frequency region selection and integrating in a vocoder-agnostic manner, the method enables on-the-fly watermark embedding during speech generation without requiring modifications or retraining of the vocoder. It supports multi-user key allocation and secure verification. Experiments on DiffWave and HiFi-GAN demonstrate near-perfect watermark extraction accuracy, strong robustness against compression and noise, and consistently high perceptual audio quality.
📝 Abstract
In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and HiFi-GAN. Moreover, the multi-user keyed construction enables scalable user-specific attribution, while the keyed verification mechanism limits unauthorized decoding, thereby reducing the risk of large-scale extractor probing and adversarial analysis. Extensive experiments on DiffWave and HiFi-GAN demonstrate that MelShield achieves reliable watermark extraction, approaching 100\% bit accuracy, even under signal distortions, e.g., compression and additive noise, while preserving high perceptual audio quality.