Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

📅 2024-02-14

🏛️ Neural Information Processing Systems

📈 Citations: 36

✨ Influential: 5

career value

200K/year

🤖 AI Summary

This work uncovers a novel security threat in the embedding space of open-source large language models (LLMs): adversaries can directly perturb continuous token embeddings to bypass alignment mechanisms—inducing jailbreaking and harmful outputs—and accurately reconstruct sensitive training data even from models that have undergone “machine unlearning.” To address this, the authors introduce the first embedding-space attack paradigm, formalize a threat model tailored to machine unlearning scenarios, and design a gradient-based adversarial perturbation method alongside a framework for evaluating forgetting robustness. Experiments on mainstream open-source models—including Llama-2 and Falcon—demonstrate that this attack is more stealthy and effective than discrete prompt-based attacks or fine-tuning. It critically exposes fundamental vulnerabilities in current alignment and unlearning techniques, revealing that embedding-space integrity remains an overlooked yet critical attack surface.

Technology Category

Application Category

📝 Abstract

Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety also becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models. Our findings highlight embedding space attacks as an important threat model in open-source LLMs. Trigger Warning: the appendix contains LLM-generated text with violence and harassment.

Problem

Research questions and friction points this paper is trying to address.

Attacking safety alignment in open-source LLMs via embedding space

Exploiting full model access to bypass safety measures effectively

Extracting deleted information from unlearned LLMs using embedding attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embedding space attack targets continuous token representations

Circumvents model alignments more efficiently than discrete attacks

Extracts deleted information from unlearned LLMs effectively

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?