🤖 AI Summary
Large language models (LLMs) are vulnerable to semantic path attacks in multi-turn dialogues, leading to safety jailbreaking. Method: We propose the first unified framework integrating red-teaming and defense alignment. It innovatively combines Monte Carlo Tree Search (MCTS) with frame-based semantic modeling to enable context-aware adversarial path discovery; introduces a bidirectional mechanism to jointly enhance attack coverage and defense proactiveness; and employs fine-grained safety-aware fine-tuning alongside dialogue state tracking for dynamic intervention. Contribution/Results: Extensive experiments on mainstream LLMs demonstrate significant improvements—+32.7% in multi-turn jailbreaking detection rate and +28.4% in defense success rate. The open-sourced implementation provides an end-to-end solution for safety evaluation and alignment in multi-turn dialogue systems.
📝 Abstract
As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.