Video Signature: In-generation Watermarking for Latent Video Diffusion Models

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the absence of generative video watermarking and the quality degradation caused by post-hoc embedding, this paper proposes, for the first time, an implicit, adaptive watermarking paradigm embedded within the latent-space diffusion generation process. Methodologically: (1) we design a Perturbation-Aware Suppression (PAS) mechanism with perception-sensitive layer freezing to balance watermark robustness and visual fidelity; (2) we introduce a lightweight temporal alignment module to ensure inter-frame consistency; and (3) we jointly optimize the implicit watermark encoder-decoder and diffusion model fine-tuning. Experiments demonstrate that our method outperforms existing approaches across extraction accuracy, PSNR/SSIM, and inference speed. It achieves over 92% robust watermark recovery under spatiotemporal attacks—including cropping and frame dropping—significantly enhancing practical utility for intellectual property protection and content traceability.

Technology Category

Application Category

📝 Abstract
The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Protecting intellectual property in AI-generated video content
Balancing video quality and watermark extraction efficiency
Ensuring temporal consistency in watermarked video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-generation watermarking for latent diffusion models
Perturbation-Aware Suppression preserves visual quality
Temporal Alignment module enhances frame consistency
🔎 Similar Papers
No similar papers found.
Y
Yu Huang
The Hong Kong University of Science and Technology (Guangzhou)
J
Junhao Chen
The Hong Kong University of Science and Technology (Guangzhou)
Q
Qi Zheng
The Hong Kong University of Science and Technology (Guangzhou)
Hanqian Li
Hanqian Li
M.Phil @HKUST(GZ)
Computer VisionLarge Language ModelNatural Language Processing
Shuliang Liu
Shuliang Liu
PhD, HKUST(GZ)
Trustworthy LLMVLMRecommendation System
Xuming Hu
Xuming Hu
Assistant Professor, HKUST(GZ) / HKUST
Natural Language ProcessingLarge Language Model