🤖 AI Summary
To address the absence of generative video watermarking and the quality degradation caused by post-hoc embedding, this paper proposes, for the first time, an implicit, adaptive watermarking paradigm embedded within the latent-space diffusion generation process. Methodologically: (1) we design a Perturbation-Aware Suppression (PAS) mechanism with perception-sensitive layer freezing to balance watermark robustness and visual fidelity; (2) we introduce a lightweight temporal alignment module to ensure inter-frame consistency; and (3) we jointly optimize the implicit watermark encoder-decoder and diffusion model fine-tuning. Experiments demonstrate that our method outperforms existing approaches across extraction accuracy, PSNR/SSIM, and inference speed. It achieves over 92% robust watermark recovery under spatiotemporal attacks—including cropping and frame dropping—significantly enhancing practical utility for intellectual property protection and content traceability.
📝 Abstract
The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios.