MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

📅 2024-11-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are vulnerable to stealthy jailbreaking attacks in multi-turn dialogues, yet existing defenses and attack methods predominantly target single-turn interactions. Method: This paper introduces the first automated jailbreaking agent specifically designed for multi-turn dialogue. Departing from conventional single-turn paradigms, it innovatively integrates risk decomposition—dynamically distributing jailbreaking intent across multiple turns—with psychologically inspired prompting, multi-step reasoning planning, dialogue state modeling, and reinforcement feedback-driven iterative optimization. Contribution/Results: The agent achieves state-of-the-art attack success rates across multiple mainstream LLMs, significantly outperforming both template-based and single-turn jailbreaking baselines. Its design bridges cognitive psychology and adversarial dialogue modeling, enabling more realistic and persistent adversarial behavior. The code and benchmark dataset will be publicly released to foster reproducible research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities, but they have also been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks. To ensure their responsible deployment in critical applications, it is crucial to understand the safety capabilities and vulnerabilities of LLMs. Previous works mainly focus on jailbreak in single-round dialogue, overlooking the potential jailbreak risks in multi-round dialogues, which are a vital way humans interact with and extract information from LLMs. Some studies have increasingly concentrated on the risks associated with jailbreak in multi-round dialogues. These efforts typically involve the use of manually crafted templates or prompt engineering techniques. However, due to the inherent complexity of multi-round dialogues, their jailbreak performance is limited. To solve this problem, we propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values posed by LLMs. We propose a risk decomposition strategy that distributes risks across multiple rounds of queries and utilizes psychological strategies to enhance attack strength. Extensive experiments show that our proposed method surpasses other attack methods and achieves state-of-the-art attack success rate. We will make the corresponding code and dataset available for future research. The code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Multi-turn Dialogue
Security Risk
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn Dialogues
Risk Diffusion
Psychological Strategies
🔎 Similar Papers
No similar papers found.