Hijacking Large Language Models via Adversarial In-Context Learning

📅 2023-11-16

🏛️ arXiv.org

📈 Citations: 33

✨ Influential: 4

career value

193K/year

🤖 AI Summary

This work identifies a novel security threat to large language models (LLMs) in in-context learning (ICL): adversarial prompt injection attacks. Unlike prior attacks, this threat requires no user-triggered input, exhibits high stealth, and is cross-task transferable. The attack employs gradient-guided search to append malicious suffixes to demonstration examples, enabling precise hijacking of model outputs. It represents the first successful triggerless, highly stealthy, and transferable prompt hijacking attack specifically targeting ICL, achieving high success rates across diverse classification and jailbreaking tasks on mainstream LLMs. To counter this threat, the authors propose a lightweight defense mechanism leveraging a small set of clean demonstrations, integrated with few-shot robust fine-tuning. This approach significantly enhances ICL robustness, reducing error rates by over 60% under adversarial conditions.

📝 Abstract

In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations (demos) in the preconditioned prompts. Despite its promising performance, crafted adversarial attacks pose a notable threat to the robustness of LLMs. Existing attacks are either easy to detect, require a trigger in user input, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable prompt injection attack against ICL, aiming to hijack LLMs to generate the target output or elicit harmful responses. In our threat model, the hacker acts as a model publisher who leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demos via prompt injection. We also propose effective defense strategies using a few shots of clean demos, enhancing the robustness of LLMs during ICL. Extensive experimental results across various classification and jailbreak tasks demonstrate the effectiveness of the proposed attack and defense strategies. This work highlights the significant security vulnerabilities of LLMs during ICL and underscores the need for further in-depth studies.

Problem

Research questions and friction points this paper is trying to address.

Hijacking LLMs via adversarial in-context learning attacks

Addressing vulnerabilities in LLM robustness during ICL

Developing defense strategies against transferable prompt injection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-based prompt search for adversarial suffixes

Transferable prompt injection attack against ICL

Defense using few-shot clean demos

🔎 Similar Papers

No similar papers found.