RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-turn jailbreaking attacks fail to model long-horizon interactive strategies. To address this limitation, this work pioneers modeling multi-turn black-box large language model (LLM) jailbreaking as a multi-step reinforcement learning problem. We propose a dual-process reward mechanism grounded in intermediate harmfulness and semantic relevance, enabling the attacker model to generate progressively inducive prompts solely from black-box query feedback—without requiring white-box access or gradient information. The method is universally applicable to arbitrary closed-source LLMs. Extensive evaluation across major models—including GPT-4, Claude, and Llama variants—and benchmarks—such as AdvBench and SafeBench—demonstrates substantial improvements in attack success rate over state-of-the-art single-turn and multi-turn baselines. Results validate the approach’s effectiveness, generalizability across diverse architectures and safety evaluations, and strategic transferability across models and tasks.

Technology Category

Application Category

📝 Abstract
Large language models are vulnerable to jailbreak attacks, threatening their safe deployment in real-world applications. This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models through a sequence of prompt-output interactions. Existing approaches typically rely on single turn optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate the problem as a multi-turn reinforcement learning task, directly optimizing the harmfulness of the final-turn output as the outcome reward. To mitigate sparse supervision and promote long-term attack strategies, we propose two heuristic process rewards: (1) controlling the harmfulness of intermediate outputs to prevent triggering the black-box model's rejection mechanisms, and (2) maintaining the semantic relevance of intermediate outputs to avoid drifting into irrelevant content. Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/RL-MTJail. Warning: This paper contains examples of harmful content.
Problem

Research questions and friction points this paper is trying to address.

Automates multi-turn jailbreaking of black-box LLMs
Optimizes long-term attack strategies using reinforcement learning
Improves success rates with heuristic process rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn reinforcement learning for jailbreak attacks
Heuristic rewards control harmfulness and relevance
Optimizes final harmful output as outcome reward
🔎 Similar Papers
No similar papers found.
X
Xiqiao Xiong
University of Science and Technology of China
Ouxiang Li
Ouxiang Li
University of Science and Technology of China
Generative ModelsTrustworthy AI
Z
Zhuo Liu
University of Science and Technology of China
Moxin Li
Moxin Li
National University of Singapore
natural language processing
W
Wentao Shi
University of Science and Technology of China
F
Fuli Feng
University of Science and Technology of China
Xiangnan He
Xiangnan He
University of Science and Technology of China
RecommendationCausalityBig DataInformation RetrievalMachine Learning