🤖 AI Summary
This paper addresses the vulnerability of large language model (LLM) system prompts to unauthorized extraction and misuse, coupled with the absence of effective copyright protection mechanisms. To this end, we propose PromptCOS—the first system prompt copyright auditing framework that relies solely on model outputs. Our method models content-level output similarity and integrates three synergistic techniques: cyclic output signal modulation, auxiliary token embedding watermarking, and joint verification queries with cover tokens—enabling robust watermark embedding and verification without accessing internal logits. Experiments demonstrate that PromptCOS achieves an average watermark similarity of 99.3%, improves discrimination over the best baseline by 60.8%, incurs ≤0.58% accuracy degradation, and reduces computational overhead by up to 98.1%. The framework thus offers high identifiability, strong tamper resistance, and practical deployability.
📝 Abstract
The rapid progress of large language models (LLMs) has greatly enhanced reasoning tasks and facilitated the development of LLM-based applications. A critical factor in improving LLM-based applications is the design of effective system prompts, which significantly impact the behavior and output quality of LLMs. However, system prompts are susceptible to theft and misuse, which could undermine the interests of prompt owners. Existing methods protect prompt copyrights through watermark injection and verification but face challenges due to their reliance on intermediate LLM outputs (e.g., logits), which limits their practical feasibility.
In this paper, we propose PromptCOS, a method for auditing prompt copyright based on content-level output similarity. It embeds watermarks by optimizing the prompt while simultaneously co-optimizing a special verification query and content-level signal marks. This is achieved by leveraging cyclic output signals and injecting auxiliary tokens to ensure reliable auditing in content-only scenarios. Additionally, it incorporates cover tokens to protect the watermark from malicious deletion. For copyright verification, PromptCOS identifies unauthorized usage by comparing the similarity between the suspicious output and the signal mark. Experimental results demonstrate that our method achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% greater than the best baseline), high fidelity (accuracy degradation of no more than 0.58%), robustness (resilience against three types of potential attacks), and computational efficiency (up to 98.1% reduction in computational cost). Our code is available at GitHub https://github.com/LianPing-cyber/PromptCOS.