KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the limitations of existing LLM jailbreaking attacks—namely, heavy reliance on manual prompt engineering, high query overhead, and insufficient prompt diversity—this paper proposes Knowledge Distillation-based Jailbreaking Attack (KDA), a novel framework for automated, low-cost jailbreak prompt generation. KDA introduces the first jailbreaking-oriented multi-attacker knowledge distillation paradigm, transferring the adversarial capabilities of multiple state-of-the-art black-box jailbreaking methods into a single lightweight open-source model, enabling end-to-end, query-efficient jailbreak prompt synthesis. Its core innovation lies in uncovering and modeling the “diversity–ensemble synergy” mechanism, integrating quantitative prompt diversity assessment with red-teaming validation. Experiments across mainstream open-source and commercial black-box LLMs demonstrate that KDA significantly improves jailbreaking success rates while reducing average query cost by 57% and increasing prompt diversity by 2.3× compared to prior approaches.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA's effectiveness and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Generate diverse jailbreak prompts

Improve attack success rates

Reduce cost and time efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation technique

Automated diverse prompt generation

Enhanced attack success rates

🔎 Similar Papers

No similar papers found.