Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 26

✨ Influential: 7

🤖 AI Summary

This paper identifies a novel security vulnerability in large language models (LLMs): in multi-turn dialogues, LLMs exhibit heightened sensitivity to “natural distribution shifts”—semantically related yet superficially benign prompts—that evade existing alignment mechanisms. To exploit this, we propose ActorAttack, the first multi-turn jailbreaking framework grounded in actor-network theory; it leverages LLM-driven role-association modeling to automatically discover covert, diverse attack paths while preserving intent invisibility. Our contributions are threefold: (1) we introduce SafeMTData, the first open-source multi-turn adversarial dataset; (2) ActorAttack significantly outperforms state-of-the-art single- and multi-turn baselines across aligned models including GPT-4o, Claude, and Qwen; and (3) fine-tuning on SafeMTData substantially enhances model robustness against multi-turn attacks.

Technology Category

Application Category

📝 Abstract

This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at https://github.com/renqibing/ActorAttack.

Problem

Research questions and friction points this paper is trying to address.

LLMs vulnerable to natural distribution shifts in prompts

Benign prompts related to harmful content bypass safety

Need broader safety training for toxic semantic space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel attack method ActorBreaker uncovers vulnerabilities

Uses actor-network theory for broader vulnerability capture

Expands safety training to cover toxic semantic space

🔎 Similar Papers

No similar papers found.

Authors to Follow