🤖 AI Summary
Multi-turn dialogues pose security risks wherein latent malicious intents can elicit harmful responses from large language models (LLMs). Method: This paper proposes a multi-turn safety alignment framework featuring: (1) a novel thought-guided multi-turn jailbreak attack modeling mechanism that explicitly captures the evolution of adversarial intent; (2) a future-reward-based bidirectional reinforcement learning algorithm enabling collaborative, iterative optimization between red-team and target models; and (3) a red-blue adversarial training paradigm achieving safety alignment at the multi-turn interaction level. Results: The proposed red-team model achieves state-of-the-art performance in multi-turn jailbreak attacks. The aligned target model demonstrates significantly enhanced robustness on benchmarks including ToxiGen and SafeBench, with a 27.4% improvement in defense success rate against stealthy multi-turn jailbreak attacks.
📝 Abstract
The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the extbf{M}ulti- extbf{T}urn extbf{S}afety extbf{A}lignment (ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.