Prompt Obfuscation for Large Language Models

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the security risk of intellectual property leakage via reverse-engineering of large language model (LLM) system prompts, this paper proposes a novel prompt obfuscation method. We introduce and realize, for the first time, a semantic-equivalence-preserving yet information-irreversible obfuscation paradigm, achieved through semantics-preserving prompt reparameterization—hiding original instruction semantics with negligible functional degradation. We construct a multi-granularity similarity evaluation framework comprising eight dimensions spanning lexical, character-level, and semantic aspects, and design a comprehensive de-obfuscation attack validation suite under both black-box and white-box settings, covering three distinct attack classes. Experiments demonstrate that obfuscated prompts retain identical performance to originals across all eight metrics; none of the three attack classes recover meaningful instructions, confirming effective IP protection; and computational overhead is negligible. This work establishes the first systematic defense mechanism for prompt engineering security, filling a critical research gap.

Technology Category

Application Category

📝 Abstract
System prompts that include detailed instructions to describe the task performed by the underlying LLM can easily transform foundation models into tools and services with minimal overhead. Because of their crucial impact on the utility, they are often considered intellectual property, similar to the code of a software product. However, extracting system prompts is easily possible. As of today, there is no effective countermeasure to prevent the stealing of system prompts and all safeguarding efforts could be evaded. In this work, we propose an alternative to conventional system prompts. We introduce prompt obfuscation to prevent the extraction of the system prompt with only little overhead. The core idea is to find a representation of the original system prompt that leads to the same functionality, while the obfuscated system prompt does not contain any information that allows conclusions to be drawn about the original system prompt. We evaluate our approach by comparing our obfuscated prompt output with the output of the original prompt, using eight distinct metrics, to measure the lexical, character-level, and semantic similarity. We show that the obfuscated version is constantly on par with the original one. We further perform three different deobfuscation attacks with varying attacker knowledge--covering both black-box and white-box conditions--and show that in realistic attack scenarios an attacker is not able to extract meaningful information. Overall, we demonstrate that prompt obfuscation is an effective mechanism to safeguard the intellectual property of a system prompt while maintaining the same utility as the original prompt.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Prompt Secrecy Protection
Unauthorized Replication Prevention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Obfuscation
Knowledge Protection
Anti-Deobfuscation Attack