You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
System prompts for large language models (LLMs) are vulnerable to leakage, and existing defenses struggle against unseen attacks. Method: We propose SysVec, which encodes plaintext system prompts into learnable, internal vector representations—termed *system vectors*—replacing explicit contextual injection. This eliminates prompt exposure in inputs, fundamentally mitigating leakage risks. SysVec integrates instruction tuning and safety alignment to preserve language capability and instruction-following performance while alleviating long-context forgetting. Contribution/Results: Experiments on state-of-the-art models—including GPT-4o and Claude 3.5 Sonnet—demonstrate that SysVec significantly enhances robustness against diverse prompt-injection and leakage attacks, maintaining functional integrity and behavioral controllability. To our knowledge, this is the first work to fully internalize system prompts as implicit, end-to-end trainable vector representations with integrated safety alignment.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have been widely adopted across various applications, leveraging customized system prompts for diverse tasks. Facing potential system prompt leakage risks, model developers have implemented strategies to prevent leakage, primarily by disabling LLMs from repeating their context when encountering known attack patterns. However, it remains vulnerable to new and unforeseen prompt-leaking techniques. In this paper, we first introduce a simple yet effective prompt leaking attack to reveal such risks. Our attack is capable of extracting system prompts from various LLM-based application, even from SOTA LLM models such as GPT-4o or Claude 3.5 Sonnet. Our findings further inspire us to search for a fundamental solution to the problems by having no system prompt in the context. To this end, we propose SysVec, a novel method that encodes system prompts as internal representation vectors rather than raw text. By doing so, SysVec minimizes the risk of unauthorized disclosure while preserving the LLM's core language capabilities. Remarkably, this approach not only enhances security but also improves the model's general instruction-following abilities. Experimental results demonstrate that SysVec effectively mitigates prompt leakage attacks, preserves the LLM's functional integrity, and helps alleviate the forgetting issue in long-context scenarios.
Problem

Research questions and friction points this paper is trying to address.

Mitigating system prompt leakage risks in LLMs
Proposing vector encoding to replace text system prompts
Preserving language capabilities while preventing unauthorized disclosure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encodes system prompts as internal vectors
Minimizes unauthorized disclosure risks
Preserves language capabilities and security
🔎 Similar Papers
No similar papers found.