๐ค AI Summary
This work addresses the vulnerability of existing neural audio watermarking methods under deep learningโbased attacks, particularly their lack of robustness against transformations that preserve linguistic content and speaker identity while altering acoustic characteristics. The study introduces, for the first time, self-voice conversion into the watermarking attack paradigm, proposing a novel and generalizable attack framework that leverages deep learning models to map speech into an acoustic space with modified features yet identical speaker identity and semantics. Experimental results demonstrate that this approach substantially degrades the extraction accuracy of multiple state-of-the-art watermarking systems, confirming its effectiveness and broad applicability. The findings expose critical security weaknesses in current neural audio watermarking schemes and establish a new challenge for the design of robust watermarking mechanisms resilient to such sophisticated attacks.
๐ Abstract
Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the rise of deep learning-based attacks introduces novel and significant threats to watermark security. In this work, we investigate self voice conversion as a universal, content-preserving attack against audio watermarking systems. Self voice conversion remaps a speaker's voice to the same identity while altering acoustic characteristics through a voice conversion model. We demonstrate that this attack severely degrades the reliability of state-of-the-art watermarking approaches and highlight its implications for the security of modern audio watermarking techniques.