🤖 AI Summary
Existing invisible watermarking methods face three critical bottlenecks: (i) proxy perceptual losses (e.g., MSE, LPIPS) exhibit significant misalignment with human visual perception, introducing visible artifacts; (ii) conflicting multi-objective optimization leads to training instability and heavy reliance on manual hyperparameter tuning; (iii) robustness and imperceptibility degrade markedly on high-resolution images and videos. This paper proposes the first purely adversarial training paradigm for invisible watermarking, featuring a three-stage decoupled optimization strategy, a JND-guided high-resolution adaptive mechanism, and runtime up-sampling simulation. Key technical innovations include JND-aware perceptual modeling, temporal watermark pooling, and multi-stage scheduling. Experiments demonstrate superior robustness against diverse attacks over state-of-the-art methods; both objective metrics and subjective evaluations confirm complete imperceptibility; and the method scales efficiently to HD video. The work bridges theoretical innovation with practical applicability.
📝 Abstract
Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.