MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are vulnerable to semantic path attacks in multi-turn dialogues, leading to safety jailbreaking. Method: We propose the first unified framework integrating red-teaming and defense alignment. It innovatively combines Monte Carlo Tree Search (MCTS) with frame-based semantic modeling to enable context-aware adversarial path discovery; introduces a bidirectional mechanism to jointly enhance attack coverage and defense proactiveness; and employs fine-grained safety-aware fine-tuning alongside dialogue state tracking for dynamic intervention. Contribution/Results: Extensive experiments on mainstream LLMs demonstrate significant improvements—+32.7% in multi-turn jailbreaking detection rate and +28.4% in defense success rate. The open-sourced implementation provides an end-to-end solution for safety evaluation and alignment in multi-turn dialogue systems.

Technology Category

Application Category

📝 Abstract
As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.
Problem

Research questions and friction points this paper is trying to address.

Addressing multi-turn dialogue vulnerabilities in large language models
Preventing jailbreaks that exploit conversational context for harmful content
Enhancing safety alignment against adversarial manipulation techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

MCTS-driven tree search for attack trajectories
Frame semantics for diverse semantic exploration
Early dialogue intervention for safety alignment
🔎 Similar Papers
No similar papers found.
S
Siyu Yan
East China Normal University
L
Long Zeng
East China Normal University
X
Xuecheng Wu
Xi’an Jiaotong University
Chengcheng Han
Chengcheng Han
Meituan | East China Normal University
NLPKG
K
Kongcheng Zhang
Zhejiang University
Chong Peng
Chong Peng
Qingdao University
机器学习、计算机视觉
Xuezhi Cao
Xuezhi Cao
Meituan
Data MiningKnowledge GraphLLMs
X
Xunliang Cai
Meituan
Chenjuan Guo
Chenjuan Guo
Professor, East China Normal University
Data AnalyticsMachine Learning