SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the high embedding overhead and poor robustness against variable-length speech in speech generation model copyright protection, this paper proposes a lightweight differentiable watermarking framework based on Low-Rank Adaptation (LoRA). It pioneers the joint design of LoRA with speech diffusion models to enable efficient parameter fine-tuning, and introduces a depthwise separable convolutional watermark decoder—the first to support high-accuracy, adaptive watermark extraction from variable-length speech. The method preserves high speech fidelity even at a high embedding capacity of 2000 bps. Under composite attacks, average watermark extraction accuracy reaches 98.43%, rising to 99.20% under time-domain stretching attacks—improving upon state-of-the-art methods by approximately 23%. Crucially, the approach significantly reduces both training and inference overhead while simultaneously ensuring strong security, robustness, and practical deployability.

Technology Category

Application Category

📝 Abstract

The accelerated advancement of speech generative models has given rise to security issues, including model infringement and unauthorized abuse of content. Although existing generative watermarking techniques have proposed corresponding solutions, most methods require substantial computational overhead and training costs. In addition, some methods have limitations in robustness when handling variable-length inputs. To tackle these challenges, we propose extsc{SOLIDO}, a novel generative watermarking method that integrates parameter-efficient fine-tuning with speech watermarking through low-rank adaptation (LoRA) for speech diffusion models. Concretely, the watermark encoder converts the watermark to align with the input of diffusion models. To achieve precise watermark extraction from variable-length inputs, the watermark decoder based on depthwise separable convolution is designed for watermark recovery. To further enhance speech generation performance and watermark extraction capability, we propose a speech-driven lightweight fine-tuning strategy, which reduces computational overhead through LoRA. Comprehensive experiments demonstrate that the proposed method ensures high-fidelity watermarked speech even at a large capacity of 2000 bps. Furthermore, against common individual and compound speech attacks, our SOLIDO achieves a maximum average extraction accuracy of 99.20% and 98.43%, respectively. It surpasses other state-of-the-art methods by nearly 23% in resisting time-stretching attacks.

Problem

Research questions and friction points this paper is trying to address.

Addresses security issues in speech generative models

Reduces computational overhead in watermarking methods

Improves robustness for variable-length speech inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA-based watermarking for speech diffusion models

Depthwise separable convolution for watermark recovery

Speech-driven lightweight fine-tuning strategy

🔎 Similar Papers

WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification