🤖 AI Summary
Existing image watermarking methods exhibit poor temporal robustness and fail to detect watermarks across frames in image-to-video (I2V) generation due to inter-frame misalignment and dynamic distortions. To address this, we propose the first temporal-robust watermarking framework specifically designed for I2V scenarios. Our method innovatively integrates optical-flow-guided training, a temporal consistency loss (TCL), and semantic preservation constraints, coupled with an instruction-enhanced FUSE encoder-decoder architecture and optical-flow-based deformation alignment. Additionally, we introduce a video diffusion surrogate model to assist optimization. The framework significantly enhances watermark stability throughout the entire I2V generation pipeline and enables reliable cross-frame watermark recovery. On mainstream I2V models, it achieves an average 18.7% improvement in bit accuracy for both the first frame and all subsequent frames. Moreover, it demonstrates strong robustness against diverse pre- and post-generation distortions.
📝 Abstract
Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.