NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Large language models (LLMs) in multi-turn dialogues remain vulnerable to stealthy jailbreaking attacks. Method: This paper proposes a modular adversarial framework comprising three components: (1) a hierarchical ThoughtNet semantic network to expand the attack’s semantic space; (2) a tripartite simulation mechanism—comprising attacker, victim, and judge agents—that enables feedback-driven query refinement and adaptive path exploration; and (3) a joint evaluation of harmfulness and semantic similarity, multi-LLM collaborative simulation, and dynamic network traversal to enhance attack stealthiness and generalizability. Contribution/Results: Extensive experiments across multiple open- and closed-source LLMs demonstrate that our method improves jailbreaking success rates by 2.1%–19.4% over state-of-the-art baselines, achieving statistically significant gains in both effectiveness and robustness.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing approaches often explore the adversarial space poorly, rely on hand-crafted heuristics, or lack systematic query refinement. We present NEXUS (Network Exploration for eXploiting Unsafe Sequences), a modular framework for constructing, refining, and executing optimized multi-turn attacks. NEXUS comprises: (1) ThoughtNet, which hierarchically expands a harmful intent into a structured semantic network of topics, entities, and query chains; (2) a feedback-driven Simulator that iteratively refines and prunes these chains through attacker-victim-judge LLM collaboration using harmfulness and semantic-similarity benchmarks; and (3) a Network Traverser that adaptively navigates the refined query space for real-time attacks. This pipeline uncovers stealthy, high-success adversarial paths across LLMs. On several closed-source and open-source LLMs, NEXUS increases attack success rate by 2.1% to 19.4% over prior methods. Code: https://github.com/inspire-lab/NEXUS

Problem

Research questions and friction points this paper is trying to address.

Detecting multi-turn jailbreak attacks on LLMs

Systematically exploring adversarial query space

Improving attack success rates against LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchically expands harmful intent into structured semantic network

Feedback-driven simulator refines query chains through LLM collaboration

Network traverser adaptively navigates refined query space for attacks

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation