WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image watermarking methods exhibit poor temporal robustness and fail to detect watermarks across frames in image-to-video (I2V) generation due to inter-frame misalignment and dynamic distortions. To address this, we propose the first temporal-robust watermarking framework specifically designed for I2V scenarios. Our method innovatively integrates optical-flow-guided training, a temporal consistency loss (TCL), and semantic preservation constraints, coupled with an instruction-enhanced FUSE encoder-decoder architecture and optical-flow-based deformation alignment. Additionally, we introduce a video diffusion surrogate model to assist optimization. The framework significantly enhances watermark stability throughout the entire I2V generation pipeline and enables reliable cross-frame watermark recovery. On mainstream I2V models, it achieves an average 18.7% improvement in bit accuracy for both the first frame and all subsequent frames. Moreover, it demonstrates strong robustness against diverse pre- and post-generation distortions.

Technology Category

Application Category

📝 Abstract
Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.
Problem

Research questions and friction points this paper is trying to address.

Enhances watermark robustness in image-to-video conversion
Addresses per-frame detection weakening in generated videos
Ensures cross-modal watermark recovery under realistic distortions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-guided synthesis engine for realistic distortions
Optical-flow warping with temporal consistency loss
Semantic preservation loss maintains conditioning signal
🔎 Similar Papers
No similar papers found.
U
Utae Jeong
Korea University
S
Sumin In
Korea University
H
Hyunju Ryu
Korea University
J
Jaewan Choi
Korea University
F
Feng Yang
Google DeepMind
J
Jongheon Jeong
Korea University
Seungryong Kim
Seungryong Kim
Associate Professor, KAIST
Computer VisionMachine Learning
Sangpil Kim
Sangpil Kim
Korea University
Computer Vision