Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models

📅 2024-12-04
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a fundamental security flaw in semantic watermarking for latent diffusion models: adversaries can forge or remove watermarks using unrelated models—across architectures and latent spaces—undermining the theoretical premise of “robust trustworthiness.” To this end, we propose two black-box attack methods: (1) real-image watermark injection/removal via latent-space alignment, and (2) target watermark image synthesis via reverse sampling and conditional regeneration, requiring only a single reference watermarked image. Technically, our approach integrates latent-space distance optimization, heterogeneous model transfer (between UNet and DiT), and target-watermark-guided conditional generation. Evaluated on mainstream schemes—including Tree-Rings and Gaussian Shading—our attacks achieve >92% success rates with imperceptible visual degradation. This is the first empirical demonstration of systemic security vulnerabilities in semantic watermarking for diffusion models.

Technology Category

Application Category

📝 Abstract
Integrating watermarking into the generation process of latent diffusion models (LDMs) simplifies detection and attribution of generated content. Semantic watermarks, such as Tree-Rings and Gaussian Shading, represent a novel class of watermarking techniques that are easy to implement and highly robust against various perturbations. However, our work demonstrates a fundamental security vulnerability of semantic watermarks. We show that attackers can leverage unrelated models, even with different latent spaces and architectures (UNet vs DiT), to perform powerful and realistic forgery attacks. Specifically, we design two watermark forgery attacks. The first imprints a targeted watermark into real images by manipulating the latent representation of an arbitrary image in an unrelated LDM to get closer to the latent representation of a watermarked image. We also show that this technique can be used for watermark removal. The second attack generates new images with the target watermark by inverting a watermarked image and re-generating it with an arbitrary prompt. Both attacks just need a single reference image with the target watermark. Overall, our findings question the applicability of semantic watermarks by revealing that attackers can easily forge or remove these watermarks under realistic conditions.
Problem

Research questions and friction points this paper is trying to address.

Exposing security flaws in semantic watermarks for diffusion models
Demonstrating forgery attacks using unrelated models and architectures
Enabling watermark removal and forgery with minimal reference images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverage unrelated models for forgery attacks
Manipulate latent representations for watermark removal
Invert watermarked images for new generation
🔎 Similar Papers
No similar papers found.