When There Is No Decoder: Removing Watermarks from Stable Diffusion Models in a No-box Setting

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the robustness deficiency of invisible watermarks in Stable Diffusion models under decoder-unavailable (“black-box”) conditions—constituting the first systematic security evaluation of watermarking in such a “black-box + no-decoder” setting. We propose three novel watermark-removal attacks: (1) structure-guided perturbation leveraging edge prediction, (2) adaptive Gaussian blurring, and (3) lightweight model fine-tuning. Ablation studies identify critical factors governing attack efficacy: message length, convolutional kernel size, and decoder depth. The strongest attack reduces watermark detection accuracy to 47.92%, successfully evading advanced defenses including multi-label smoothing. Our findings expose fundamental vulnerabilities of current generative-model watermarking schemes in realistic adversarial settings, providing both a rigorous empirical benchmark and urgent design guidance for developing robust watermarking mechanisms.

Technology Category

Application Category

📝 Abstract

Watermarking has emerged as a promising solution to counter harmful or deceptive AI-generated content by embedding hidden identifiers that trace content origins. However, the robustness of current watermarking techniques is still largely unexplored, raising critical questions about their effectiveness against adversarial attacks. To address this gap, we examine the robustness of model-specific watermarking, where watermark embedding is integrated with text-to-image generation in models like latent diffusion models. We introduce three attack strategies: edge prediction-based, box blurring, and fine-tuning-based attacks in a no-box setting, where an attacker does not require access to the ground-truth watermark decoder. Our findings reveal that while model-specific watermarking is resilient against basic evasion attempts, such as edge prediction, it is notably vulnerable to blurring and fine-tuning-based attacks. Our best-performing attack achieves a reduction in watermark detection accuracy to approximately 47.92%. Additionally, we perform an ablation study on factors like message length, kernel size and decoder depth, identifying critical parameters influencing the fine-tuning attack's success. Finally, we assess several advanced watermarking defenses, finding that even the most robust methods, such as multi-label smoothing, result in watermark extraction accuracy that falls below an acceptable level when subjected to our no-box attacks.

Problem

Research questions and friction points this paper is trying to address.

Examining robustness of model-specific watermarking against attacks

Introducing no-box attack strategies to remove watermarks

Assessing vulnerabilities of watermarking to blurring and fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Edge prediction-based attack in no-box setting

Box blurring attack to reduce watermark accuracy

Fine-tuning-based attack to bypass watermark defenses

🔎 Similar Papers

DiffuseTrace: A Transparent and Flexible Watermarking Scheme for Latent Diffusion Model