AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address copyright infringement risks in multimodal large models, existing watermarking methods—relying on model retraining or backdoor-based anomalous output distributions—suffer from poor stealth and are vulnerable to detection or forgery. This paper proposes AGATE, the first black-box watermarking framework that preserves model integrity without parameter modification. AGATE embeds imperceptible watermarks at the input level via a novel adversarial trigger generation module and a post-hoc semantic correction mechanism. It integrates adversarial example generation, cross-modal embedding alignment, and black-box post-processing transformations. A dual-path verification mechanism ensures robustness and unforgeability. Extensive experiments across five benchmark datasets for image-text retrieval and image classification demonstrate state-of-the-art performance. Moreover, AGATE maintains high watermark detection rates under two strong adversarial attacks—input perturbation and model fine-tuning—validating its practical viability and resilience.

Technology Category

Application Category

📝 Abstract

Recent advancement in large-scale Artificial Intelligence (AI) models offering multimodal services have become foundational in AI systems, making them prime targets for model theft. Existing methods select Out-of-Distribution (OoD) data as backdoor watermarks and retrain the original model for copyright protection. However, existing methods are susceptible to malicious detection and forgery by adversaries, resulting in watermark evasion. In this work, we propose Model-underline{ag}nostic Black-box Backdoor Wunderline{ate}rmarking Framework (AGATE) to address stealthiness and robustness challenges in multimodal model copyright protection. Specifically, we propose an adversarial trigger generation method to generate stealthy adversarial triggers from ordinary dataset, providing visual fidelity while inducing semantic shifts. To alleviate the issue of anomaly detection among model outputs, we propose a post-transform module to correct the model output by narrowing the distance between adversarial trigger image embedding and text embedding. Subsequently, a two-phase watermark verification is proposed to judge whether the current model infringes by comparing the two results with and without the transform module. Consequently, we consistently outperform state-of-the-art methods across five datasets in the downstream tasks of multimodal image-text retrieval and image classification. Additionally, we validated the robustness of AGATE under two adversarial attack scenarios.

Problem

Research questions and friction points this paper is trying to address.

Stealthy black-box watermarking for multimodal model copyright protection

Addressing malicious detection and forgery in watermarking methods

Ensuring robustness and stealthiness in adversarial trigger generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial trigger generation for stealthy watermarks

Post-transform module corrects model output anomalies

Two-phase watermark verification detects model infringement

🔎 Similar Papers

Agentic Copyright Watermarking against Adversarial Evidence Forgery with Purification-Agnostic Curriculum Proxy Learning