Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 1

🤖 AI Summary

Existing adversarial suffix generation methods suffer from high computational overhead and low attack success rates against strongly aligned models (e.g., Llama-2/3). To address this, we propose ADV-LLM, the first framework introducing the paradigm of *iterative self-tuning adversarial large language models*. It integrates gradient-driven iterative optimization of adversarial suffixes, model-intrinsic adversarial evolution, and cross-architecture transfer mechanisms to efficiently generate highly transferable jailbreaking suffixes. Our method achieves near-perfect attack success rates (≈100%) on open-source models, 99% on GPT-3.5, and 49% on GPT-4—substantially outperforming prior approaches while reducing computational cost. Moreover, it demonstrates strong generalization across diverse model families and alignment strategies. As a further contribution, we release the first large-scale jailbreaking safety evaluation benchmark dataset, providing a standardized empirical foundation and new tools for rigorous LLM security research.

Technology Category

Application Category

📝 Abstract

Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety.

Problem

Research questions and friction points this paper is trying to address.

Enhance jailbreak capabilities of Large Language Models (LLMs)

Reduce computational cost of generating adversarial suffixes

Improve attack success rates on well-aligned and closed-source models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative self-tuning for adversarial LLMs

Reduces computational cost of suffix generation

Achieves high attack success rates on LLMs

🔎 Similar Papers

No similar papers found.

Authors to Follow