SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of existing watermarking methods for video diffusion models, which rely on non-blind extraction, require storing large key sets, involve costly template matching, and exhibit insufficient robustness to temporal perturbations under causal 3D VAE architectures. To overcome these challenges, we propose SIGMark—the first blind watermarking framework tailored for video diffusion models. SIGMark embeds watermarks by generating initial noise through global frame-level pseudo-random coding and introduces a Segmented Group Ordering (SGO) module specifically designed to align with causal 3D VAEs, thereby significantly enhancing robustness against spatiotemporal distortions. Experimental results demonstrate that SIGMark achieves high bit-wise extraction accuracy under diverse perturbations while substantially reducing storage and computational overhead, offering both scalability and practical applicability.

Technology Category

Application Category

📝 Abstract

Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at https://jeremyzhao1998.github.io/SIGMark-release/.

Problem

Research questions and friction points this paper is trying to address.

in-generation watermarking

blind extraction

video diffusion models

temporal robustness

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

blind extraction

in-generation watermarking

video diffusion models