🤖 AI Summary
This work addresses critical challenges in defending large language models (LLMs) against universal jailbreak attacks—namely, poor generalizability, high computational overhead, and limited defense efficacy. Methodologically, it departs from conventional single-sample prompt optimization by introducing JUMP, the first cross-task transferable multi-prompt joint optimization framework, coupled with its defensive counterpart DUMP. JUMP integrates gradient-guided collaborative multi-prompt optimization, task-agnostic prompt embedding learning, adversarial prompt distillation, and defense alignment. Experiments across multiple mainstream LLMs demonstrate that JUMP achieves a 23.6% higher attack success rate than state-of-the-art methods; moreover, it attains over 89% zero-shot task transfer efficiency. Concurrently, DUMP delivers efficient and robust defense. This work establishes the first unified, co-evolutionary paradigm for prompt optimization—jointly advancing both attack and defense—thereby redefining the landscape of universal jailbreak mitigation.
📝 Abstract
Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.