MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Multi-turn dialogues pose security risks wherein latent malicious intents can elicit harmful responses from large language models (LLMs). Method: This paper proposes a multi-turn safety alignment framework featuring: (1) a novel thought-guided multi-turn jailbreak attack modeling mechanism that explicitly captures the evolution of adversarial intent; (2) a future-reward-based bidirectional reinforcement learning algorithm enabling collaborative, iterative optimization between red-team and target models; and (3) a red-blue adversarial training paradigm achieving safety alignment at the multi-turn interaction level. Results: The proposed red-team model achieves state-of-the-art performance in multi-turn jailbreak attacks. The aligned target model demonstrates significantly enhanced robustness on benchmarks including ToxiGen and SafeBench, with a 27.4% improvement in defense success rate against stealthy multi-turn jailbreak attacks.

Technology Category

Application Category

📝 Abstract

The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the extbf{M}ulti- extbf{T}urn extbf{S}afety extbf{A}lignment (ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Addressing hidden malicious intentions in multi-round LLM dialogues

Improving LLM security against multi-turn jailbreak attacks

Enhancing safety alignment robustness via adversarial optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn reinforcement learning for safety

Thought-guided attack learning stage

Adversarial iterative optimization between models

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?