🤖 AI Summary
To address the proliferation of generative speech forgeries, existing frequency-domain watermarking methods achieve robustness at the cost of temporal fine-grained features, severely degrading speech fidelity. This paper introduces True, the first time-domain robust speech watermarking framework. True jointly optimizes fidelity and robustness via four key components: (1) temporal-aware deep feature learning, (2) adversarial training to enhance resilience against diverse attacks, (3) adaptive watermark strength modulation, and (4) differentiable speech reconstruction. Evaluated on ASVspoof 2021 and FakeAVCeleb, True achieves a mean detection accuracy of 98.7% and a MOS score of 4.62 for naturalness—substantially outperforming state-of-the-art frequency-domain approaches. To our knowledge, True is the first method to simultaneously achieve high perceptual fidelity and strong time-domain robustness, establishing a new paradigm for secure and transparent speech watermarking.
📝 Abstract
The rapid advancement of generative models has led to the synthesis of real-fake ambiguous voices. To erase the ambiguity, embedding watermarks into the frequency-domain features of synthesized voices has become a common routine. However, the robustness achieved by choosing the frequency domain often comes at the expense of fine-grained voice features, leading to a loss of fidelity. Maximizing the comprehensive learning of time-domain features to enhance fidelity while maintaining robustness, we pioneer a extbf{underline{t}}emporal-aware extbf{underline{r}}ob extbf{underline{u}}st wat extbf{underline{e}}rmarking (emph{True}) method for protecting the speech and singing voice.