Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of invisible watermarking techniques for generative video copyright protection, this paper proposes the first end-to-end watermark embedding method integrated into text-to-video generation frameworks. Methodologically, it introduces (1) a visual-similarity-driven hierarchical patch-matching mechanism—from coarse- to fine-grained—that enables adaptive alignment between watermarks and dynamic video frames; and (2) the first application of state-space models (Mamba) in watermarking, synergistically combined with 3D wavelet transforms and spatiotemporal local scanning to enhance robustness and fidelity. Experiments demonstrate state-of-the-art performance across video quality (PSNR/SSIM), watermark imperceptibility (LPIPS), extraction accuracy (>98.5%), and resilience against compression, cropping, and re-encoding attacks—significantly outperforming naïve adaptations of image watermarking methods. The approach supports efficient training and deployment, establishing a new benchmark for generative video watermarking.

Technology Category

Application Category

📝 Abstract
The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. We will release our code upon publication.
Problem

Research questions and friction points this paper is trying to address.

Ensuring copyright protection in AI-generated videos
Integrating invisible watermarks during video synthesis
Improving watermark robustness and visual fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical coarse-to-fine adaptive matching mechanism
3D wavelet transform-enhanced Mamba architecture
Spatiotemporal local scanning strategy
🔎 Similar Papers
No similar papers found.
Z
Zihan Su
Tsinghua University
Xuerui Qiu
Xuerui Qiu
Institue of Automation, Chinese Academy of Sciences
Representation Learning3D Computer VisionModel Compression
H
Hongbin Xu
South China University of Technology
T
Tangyu Jiang
Tsinghua University
Junhao Zhuang
Junhao Zhuang
Tsinghua University
Image editingvideo signal processing
C
Chun Yuan
Tsinghua University
M
Ming Li
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Shengfeng He
Shengfeng He
Singapore Management University
Visual ComputingGenerative ModelsComputer VisionComputational PhotographyComputer Graphics
F
Fei Richard Yu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)