π€ AI Summary
Large language models (LLMs) exhibit significant vulnerability to jailbreak attacks in multi-turn interactions. Method: This paper proposes HarmNet, an adaptive multi-turn attack framework grounded in semantic modeling and dynamic path optimization. It introduces (1) ThoughtNetβa hierarchical semantic network that explicitly models attacker intent and constraints; (2) a feedback simulator-driven iterative query optimization mechanism; and (3) a real-time adaptive network traversal strategy enabling dynamic evolution and fine-grained search of attack paths. Results: HarmNet achieves a 99.4% attack success rate on Mistral-7B, outperforming the best baseline by 13.9%. It consistently surpasses state-of-the-art methods across both open- and closed-source LLMs, demonstrating the effectiveness and generalizability of semantic-guided multi-turn attack paradigms.
π Abstract
Large Language Models (LLMs) remain vulnerable to multi-turn jailbreak attacks. We introduce HarmNet, a modular framework comprising ThoughtNet, a hierarchical semantic network; a feedback-driven Simulator for iterative query refinement; and a Network Traverser for real-time adaptive attack execution. HarmNet systematically explores and refines the adversarial space to uncover stealthy, high-success attack paths. Experiments across closed-source and open-source LLMs show that HarmNet outperforms state-of-the-art methods, achieving higher attack success rates. For example, on Mistral-7B, HarmNet achieves a 99.4% attack success rate, 13.9% higher than the best baseline. Index terms: jailbreak attacks; large language models; adversarial framework; query refinement.