🤖 AI Summary
To address the challenges of auditing soft prompt copyright in vision-language models—namely, difficulty in detection, high false-positive rates, and functional degradation caused by existing methods—this paper proposes the first serialized watermarking framework specifically designed for soft prompts. Our approach innovatively embeds watermarks into out-of-distribution class ranking sequences, deliberately positioned away from the primary task’s decision space; watermark encoding and verification are performed non-intrusively leveraging CLIP’s zero-shot classification capability. We further design a lightweight, hypothesis-testing-guided verification protocol to ensure audit reliability. Evaluated across 11 benchmark datasets, our method preserves original model performance (i.e., exhibits zero harmfulness), significantly reduces false positives, and demonstrates strong robustness against adaptive attacks—including pruning, fine-tuning, and knowledge distillation. To the best of our knowledge, this is the first soft prompt watermarking scheme achieving high-accuracy, low-interference, and attack-resilient copyright authentication.
📝 Abstract
Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts, as carefully crafted modules that efficiently adapt vision-language models to specific tasks, necessitate effective copyright protection. In this paper, we investigate model copyright protection by auditing whether suspicious third-party models incorporate protected soft prompts. While this can be viewed as a special case of model ownership auditing, our analysis shows that existing techniques are ineffective due to prompt learning's unique characteristics. Non-intrusive auditing is inherently prone to false positives when independent models share similar data distributions with victim models. Intrusive approaches also fail: backdoor methods designed for CLIP cannot embed functional triggers, while extending traditional DNN backdoor techniques to prompt learning suffers from harmfulness and ambiguity challenges. We find that these failures in intrusive auditing stem from the same fundamental reason: watermarking operates within the same decision space as the primary task yet pursues opposing objectives. Motivated by these findings, we propose sequential watermarking for soft prompts (SWAP), which implants watermarks into a different and more complex space. SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes, inspired by the zero-shot prediction capability of CLIP. This watermark, which is embedded in a more complex space, keeps the original prediction label unchanged, making it less opposed to the primary task. We further design a hypothesis-test-guided verification protocol for SWAP and provide theoretical analyses of success conditions. Extensive experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.