🤖 AI Summary
This work proposes SKeDA, a novel framework for robust watermarking in text-to-video generation that addresses the vulnerability of existing methods to temporal distortions such as frame misalignment, reordering, dropping, and compression. By embedding watermarks during the initial noise sampling stage of diffusion models, SKeDA employs a pseudorandom permutation key to enable frame-independent encryption. It further introduces a distribution-preserving sampling strategy that reformulates watermark extraction from sequential decoding to set-based aggregation, significantly enhancing robustness against frame reordering and loss. Additionally, a differential attention mechanism is designed to dynamically mitigate inter-frame distortions. Experimental results demonstrate that SKeDA achieves substantially higher watermark extraction accuracy and robustness under diverse perturbations while preserving high video generation quality.
📝 Abstract
The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.